Skip to content

Content

omniread.core.content

Summary

Canonical content models for OmniRead.

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

Classes

Content dataclass

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

Notes

Responsibilities:

1
2
3
4
- A `Content` instance represents a raw content payload along with
  minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
  parsers, and downstream consumers.
Attributes
content_type class-attribute instance-attribute
content_type: Optional[ContentType] = None

Optional MIME type of the content, if known.

metadata class-attribute instance-attribute
metadata: Optional[Mapping[str, Any]] = None

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

raw instance-attribute
raw: bytes

Raw content bytes as retrieved from the source.

source instance-attribute
source: str

Identifier of the content origin (URL, file path, or logical name).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

1
2
3
4
- This enum represents the declared or inferred media type of the
  content source.
- It is primarily used for routing content to the appropriate
  parser or downstream consumer.
Attributes
HTML class-attribute instance-attribute
HTML = 'text/html'

HTML document content.

JSON class-attribute instance-attribute
JSON = 'application/json'

JSON document content.

PDF class-attribute instance-attribute
PDF = 'application/pdf'

PDF document content.

XML class-attribute instance-attribute
XML = 'application/xml'

XML document content.