{ "module": "omniread.core", "content": { "path": "omniread.core", "docstring": "Core domain contracts for OmniRead.\n\n---\n\n## Summary\n\nThis package defines the **format-agnostic domain layer** of OmniRead.\nIt exposes canonical content models and abstract interfaces that are\nimplemented by format-specific modules (HTML, PDF, etc.).\n\nPublic exports from this package are considered **stable contracts** and\nare safe for downstream consumers to depend on.\n\nSubmodules:\n- content: Canonical content models and enums\n- parser: Abstract parsing contracts\n- scraper: Abstract scraping contracts\n\nFormat-specific behavior must not be introduced at this layer.\n\n---\n\n## Public API\n\n Content\n ContentType\n\n---", "objects": { "Content": { "name": "Content", "kind": "class", "path": "omniread.core.Content", "signature": "", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "members": { "raw": { "name": "raw", "kind": "attribute", "path": "omniread.core.Content.raw", "signature": "", "docstring": "Raw content bytes as retrieved from the source." }, "source": { "name": "source", "kind": "attribute", "path": "omniread.core.Content.source", "signature": "", "docstring": "Identifier of the content origin (URL, file path, or logical name)." }, "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.core.Content.content_type", "signature": "", "docstring": "Optional MIME type of the content, if known." }, "metadata": { "name": "metadata", "kind": "attribute", "path": "omniread.core.Content.metadata", "signature": "", "docstring": "Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes)." } } }, "ContentType": { "name": "ContentType", "kind": "class", "path": "omniread.core.ContentType", "signature": "", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "members": { "HTML": { "name": "HTML", "kind": "attribute", "path": "omniread.core.ContentType.HTML", "signature": "", "docstring": "HTML document content." }, "PDF": { "name": "PDF", "kind": "attribute", "path": "omniread.core.ContentType.PDF", "signature": "", "docstring": "PDF document content." }, "JSON": { "name": "JSON", "kind": "attribute", "path": "omniread.core.ContentType.JSON", "signature": "", "docstring": "JSON document content." }, "XML": { "name": "XML", "kind": "attribute", "path": "omniread.core.ContentType.XML", "signature": "", "docstring": "XML document content." } } }, "BaseParser": { "name": "BaseParser", "kind": "class", "path": "omniread.core.BaseParser", "signature": "", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "members": { "supported_types": { "name": "supported_types", "kind": "attribute", "path": "omniread.core.BaseParser.supported_types", "signature": "", "docstring": "Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic." }, "content": { "name": "content", "kind": "attribute", "path": "omniread.core.BaseParser.content", "signature": "", "docstring": null }, "parse": { "name": "parse", "kind": "function", "path": "omniread.core.BaseParser.parse", "signature": "", "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" }, "supports": { "name": "supports", "kind": "function", "path": "omniread.core.BaseParser.supports", "signature": "", "docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise." } } }, "BaseScraper": { "name": "BaseScraper", "kind": "class", "path": "omniread.core.BaseScraper", "signature": "", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.core.BaseScraper.fetch", "signature": "", "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" } } }, "content": { "name": "content", "kind": "module", "path": "omniread.core.content", "signature": null, "docstring": "Canonical content models for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.", "members": { "Enum": { "name": "Enum", "kind": "alias", "path": "omniread.core.content.Enum", "signature": "", "docstring": null }, "dataclass": { "name": "dataclass", "kind": "alias", "path": "omniread.core.content.dataclass", "signature": "", "docstring": null }, "Any": { "name": "Any", "kind": "alias", "path": "omniread.core.content.Any", "signature": "", "docstring": null }, "Mapping": { "name": "Mapping", "kind": "alias", "path": "omniread.core.content.Mapping", "signature": "", "docstring": null }, "Optional": { "name": "Optional", "kind": "alias", "path": "omniread.core.content.Optional", "signature": "", "docstring": null }, "ContentType": { "name": "ContentType", "kind": "class", "path": "omniread.core.content.ContentType", "signature": "", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "members": { "HTML": { "name": "HTML", "kind": "attribute", "path": "omniread.core.content.ContentType.HTML", "signature": null, "docstring": "HTML document content." }, "PDF": { "name": "PDF", "kind": "attribute", "path": "omniread.core.content.ContentType.PDF", "signature": null, "docstring": "PDF document content." }, "JSON": { "name": "JSON", "kind": "attribute", "path": "omniread.core.content.ContentType.JSON", "signature": null, "docstring": "JSON document content." }, "XML": { "name": "XML", "kind": "attribute", "path": "omniread.core.content.ContentType.XML", "signature": null, "docstring": "XML document content." } } }, "Content": { "name": "Content", "kind": "class", "path": "omniread.core.content.Content", "signature": "", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "members": { "raw": { "name": "raw", "kind": "attribute", "path": "omniread.core.content.Content.raw", "signature": null, "docstring": "Raw content bytes as retrieved from the source." }, "source": { "name": "source", "kind": "attribute", "path": "omniread.core.content.Content.source", "signature": null, "docstring": "Identifier of the content origin (URL, file path, or logical name)." }, "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.core.content.Content.content_type", "signature": null, "docstring": "Optional MIME type of the content, if known." }, "metadata": { "name": "metadata", "kind": "attribute", "path": "omniread.core.content.Content.metadata", "signature": null, "docstring": "Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes)." } } } } }, "parser": { "name": "parser", "kind": "module", "path": "omniread.core.parser", "signature": null, "docstring": "Abstract parsing contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources", "members": { "ABC": { "name": "ABC", "kind": "alias", "path": "omniread.core.parser.ABC", "signature": "", "docstring": null }, "abstractmethod": { "name": "abstractmethod", "kind": "alias", "path": "omniread.core.parser.abstractmethod", "signature": "", "docstring": null }, "Generic": { "name": "Generic", "kind": "alias", "path": "omniread.core.parser.Generic", "signature": "", "docstring": null }, "TypeVar": { "name": "TypeVar", "kind": "alias", "path": "omniread.core.parser.TypeVar", "signature": "", "docstring": null }, "Set": { "name": "Set", "kind": "alias", "path": "omniread.core.parser.Set", "signature": "", "docstring": null }, "Content": { "name": "Content", "kind": "class", "path": "omniread.core.parser.Content", "signature": "", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "members": { "raw": { "name": "raw", "kind": "attribute", "path": "omniread.core.parser.Content.raw", "signature": "", "docstring": "Raw content bytes as retrieved from the source." }, "source": { "name": "source", "kind": "attribute", "path": "omniread.core.parser.Content.source", "signature": "", "docstring": "Identifier of the content origin (URL, file path, or logical name)." }, "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.core.parser.Content.content_type", "signature": "", "docstring": "Optional MIME type of the content, if known." }, "metadata": { "name": "metadata", "kind": "attribute", "path": "omniread.core.parser.Content.metadata", "signature": "", "docstring": "Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes)." } } }, "ContentType": { "name": "ContentType", "kind": "class", "path": "omniread.core.parser.ContentType", "signature": "", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "members": { "HTML": { "name": "HTML", "kind": "attribute", "path": "omniread.core.parser.ContentType.HTML", "signature": "", "docstring": "HTML document content." }, "PDF": { "name": "PDF", "kind": "attribute", "path": "omniread.core.parser.ContentType.PDF", "signature": "", "docstring": "PDF document content." }, "JSON": { "name": "JSON", "kind": "attribute", "path": "omniread.core.parser.ContentType.JSON", "signature": "", "docstring": "JSON document content." }, "XML": { "name": "XML", "kind": "attribute", "path": "omniread.core.parser.ContentType.XML", "signature": "", "docstring": "XML document content." } } }, "T": { "name": "T", "kind": "attribute", "path": "omniread.core.parser.T", "signature": null, "docstring": null }, "BaseParser": { "name": "BaseParser", "kind": "class", "path": "omniread.core.parser.BaseParser", "signature": "", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "members": { "supported_types": { "name": "supported_types", "kind": "attribute", "path": "omniread.core.parser.BaseParser.supported_types", "signature": null, "docstring": "Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic." }, "content": { "name": "content", "kind": "attribute", "path": "omniread.core.parser.BaseParser.content", "signature": null, "docstring": null }, "parse": { "name": "parse", "kind": "function", "path": "omniread.core.parser.BaseParser.parse", "signature": "", "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" }, "supports": { "name": "supports", "kind": "function", "path": "omniread.core.parser.BaseParser.supports", "signature": "", "docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise." } } } } }, "scraper": { "name": "scraper", "kind": "module", "path": "omniread.core.scraper", "signature": null, "docstring": "Abstract scraping contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.", "members": { "ABC": { "name": "ABC", "kind": "alias", "path": "omniread.core.scraper.ABC", "signature": "", "docstring": null }, "abstractmethod": { "name": "abstractmethod", "kind": "alias", "path": "omniread.core.scraper.abstractmethod", "signature": "", "docstring": null }, "Any": { "name": "Any", "kind": "alias", "path": "omniread.core.scraper.Any", "signature": "", "docstring": null }, "Mapping": { "name": "Mapping", "kind": "alias", "path": "omniread.core.scraper.Mapping", "signature": "", "docstring": null }, "Optional": { "name": "Optional", "kind": "alias", "path": "omniread.core.scraper.Optional", "signature": "", "docstring": null }, "Content": { "name": "Content", "kind": "class", "path": "omniread.core.scraper.Content", "signature": "", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "members": { "raw": { "name": "raw", "kind": "attribute", "path": "omniread.core.scraper.Content.raw", "signature": "", "docstring": "Raw content bytes as retrieved from the source." }, "source": { "name": "source", "kind": "attribute", "path": "omniread.core.scraper.Content.source", "signature": "", "docstring": "Identifier of the content origin (URL, file path, or logical name)." }, "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.core.scraper.Content.content_type", "signature": "", "docstring": "Optional MIME type of the content, if known." }, "metadata": { "name": "metadata", "kind": "attribute", "path": "omniread.core.scraper.Content.metadata", "signature": "", "docstring": "Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes)." } } }, "BaseScraper": { "name": "BaseScraper", "kind": "class", "path": "omniread.core.scraper.BaseScraper", "signature": "", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.core.scraper.BaseScraper.fetch", "signature": "", "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" } } } } } } } }