Core

omniread.core

Core domain contracts for OmniRead.

Summary

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts

Format-specific behavior must not be introduced at this layer.

Public API

Content
ContentType

Classes

BaseParser

BaseParser(content: Content)

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the Content it is responsible for interpreting
- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`

Responsibilities:

- Implementations must declare supported content types via `supported_types`
- Implementations must raise parsing-specific exceptions from `parse()`
- Implementations must remain deterministic for a given input

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

Attributes

supported_types `class-attribute` `instance-attribute`

supported_types: Set[ContentType] = set()

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

Functions

parse `abstractmethod`

parse() -> T

Parse the owned content into structured output.

Returns:

Name	Type	Description
`T`	`T`	Parsed, structured representation.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and return a deterministic, structured output

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name	Type	Description
`bool`	`bool`	True if the content type is supported; False otherwise.

BaseScraper

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it
- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object
- Scrapers define how content is obtained, not what the content means
- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior

Constraints:

- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser

Functions

fetch `abstractmethod`

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch raw content from the given source.

Parameters:

Name	Type	Description	Default
`source`	`str`	Location identifier (URL, file path, S3 URI, etc.)	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional hints for the scraper (headers, auth, etc.)	`None`

Returns:

Name	Type	Description
`Content`	`Content`	Content object containing raw bytes and metadata.

Raises:

Type	Description
`Exception`	Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object

Content `dataclass`

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers

Attributes

content_type `class-attribute` `instance-attribute`

content_type: Optional[ContentType] = None

Optional MIME type of the content, if known.

metadata `class-attribute` `instance-attribute`

metadata: Optional[Mapping[str, Any]] = None

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

raw `instance-attribute`

raw: bytes

Raw content bytes as retrieved from the source.

source `instance-attribute`

source: str

Identifier of the content origin (URL, file path, or logical name).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source
- It is primarily used for routing content to the appropriate parser or downstream consumer

Attributes

HTML `class-attribute` `instance-attribute`

HTML = 'text/html'

HTML document content.

JSON `class-attribute` `instance-attribute`

JSON = 'application/json'

JSON document content.

PDF `class-attribute` `instance-attribute`

PDF = 'application/pdf'

PDF document content.

XML `class-attribute` `instance-attribute`

XML = 'application/xml'

XML document content.

Core

omniread.core

Summary

Public API

Classes

BaseParser

Attributes

supported_types class-attribute instance-attribute

Functions

parse abstractmethod

supports

BaseScraper

Functions

fetch abstractmethod

Content dataclass

Attributes

content_type class-attribute instance-attribute

metadata class-attribute instance-attribute

raw instance-attribute

source instance-attribute

ContentType

Attributes

HTML class-attribute instance-attribute

JSON class-attribute instance-attribute

PDF class-attribute instance-attribute

XML class-attribute instance-attribute

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`

fetch `abstractmethod`

Content `dataclass`

content_type `class-attribute` `instance-attribute`

metadata `class-attribute` `instance-attribute`

raw `instance-attribute`

source `instance-attribute`

HTML `class-attribute` `instance-attribute`

JSON `class-attribute` `instance-attribute`

PDF `class-attribute` `instance-attribute`

XML `class-attribute` `instance-attribute`