Core
omniread.core
Core domain contracts for OmniRead.
Summary
This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).
Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.
Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts
Format-specific behavior must not be introduced at this layer.
Public API
1 2 | |
Classes
BaseParser
Bases: ABC, Generic[T]
Base interface for all parsers.
Notes
Guarantees:
1 2 | |
Responsibilities:
1 2 3 | |
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
Attributes
supported_types
class-attribute
instance-attribute
Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.
Functions
parse
abstractmethod
Parse the owned content into structured output.
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
Parsed, structured representation. |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
Notes
Responsibilities:
1 | |
supports
Check whether this parser supports the content's type.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the content type is supported; False otherwise. |
BaseScraper
Bases: ABC
Base interface for all scrapers.
Notes
Responsibilities:
1 2 3 4 | |
Constraints:
1 | |
Functions
fetch
abstractmethod
Fetch raw content from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
Location identifier (URL, file path, S3 URI, etc.) |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional hints for the scraper (headers, auth, etc.) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Content |
Content
|
Content object containing raw bytes and metadata. |
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors as defined by the implementation. |
Notes
Responsibilities:
1 | |
Content
dataclass
Normalized representation of extracted content.
Notes
Responsibilities:
1 2 | |
Attributes
content_type
class-attribute
instance-attribute
Optional MIME type of the content, if known.
metadata
class-attribute
instance-attribute
Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
source
instance-attribute
Identifier of the content origin (URL, file path, or logical name).
ContentType
Bases: str, Enum
Supported MIME types for extracted content.
Notes
Guarantees:
1 2 | |