Files
docs/libs/omniread/site/search/search_index.json
Vishesh 'ironeagle' Bangotra 9191de9dff
All checks were successful
continuous-integration/drone/push Build is passing
Build: 0.1.8
2026-03-08 01:01:39 +05:30

1 line
139 KiB
JSON

{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"omniread","text":"<ul> <li>Omniread</li> </ul>"},{"location":"#omniread","title":"omniread","text":"<p>OmniRead \u2014 format-agnostic content acquisition and parsing framework.</p>"},{"location":"#omniread--summary","title":"Summary","text":"<p>OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.</p> <p>The library is structured around three core concepts:</p> <ol> <li>Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.</li> <li>Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.</li> <li>Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.</li> </ol> <p>OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior</p>"},{"location":"#omniread--installation","title":"Installation","text":"<p>Install OmniRead using pip:</p> <pre><code>pip install omniread\n</code></pre> <p>Or with Poetry:</p> <pre><code>poetry add omniread\n</code></pre>"},{"location":"#omniread--quick-start","title":"Quick start","text":"<p>HTML example:</p> <pre><code>from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n def parse(self) -&gt; str:\n return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n</code></pre> <p>PDF example:</p> <pre><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n def parse(self) -&gt; str:\n # implement PDF text extraction\n ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n</code></pre>"},{"location":"#omniread--public-api","title":"Public API","text":"<p>This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.</p> <p>Core: - Content - ContentType</p> <p>HTML: - HTMLScraper - HTMLParser</p> <p>PDF: - FileSystemPDFClient - PDFScraper - PDFParser</p> <p>Core Philosophy: <code>OmniRead</code> is designed as a decoupled content engine: 1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other. 2. Normalized Exchange: All components communicate via the <code>Content</code> model, ensuring a consistent contract. 3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.</p>"},{"location":"#omniread-classes","title":"Classes","text":""},{"location":"#omniread.Content","title":"Content <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n</code></pre>"},{"location":"#omniread.Content-attributes","title":"Attributes","text":""},{"location":"#omniread.Content.content_type","title":"content_type <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"#omniread.Content.metadata","title":"metadata <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"#omniread.Content.raw","title":"raw <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"#omniread.Content.source","title":"source <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"#omniread.ContentType","title":"ContentType","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n</code></pre>"},{"location":"#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"#omniread.ContentType.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"#omniread.ContentType.JSON","title":"JSON <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"#omniread.ContentType.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"#omniread.ContentType.XML","title":"XML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p> Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns their raw binary contents\n</code></pre>"},{"location":"#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"#omniread.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n</code></pre> <p>Guarantees:</p> <pre><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"#omniread.HTMLParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"#omniread.HTMLParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n</code></pre>"},{"location":"#omniread.HTMLParser.parse_div","title":"parse_div <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"#omniread.HTMLParser.parse_link","title":"parse_link <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document\n- This includes: Document title, `&lt;meta&gt;` tag name/property \u2192 content mappings\n</code></pre>"},{"location":"#omniread.HTMLParser.parse_table","title":"parse_table <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"#omniread.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"#omniread.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"#omniread.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"#omniread.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"#omniread.PDFParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"#omniread.PDFParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n</code></pre>"},{"location":"#omniread.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"#omniread.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"#omniread.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"core/","title":"Core","text":""},{"location":"core/#omniread.core","title":"omniread.core","text":"<p>Core domain contracts for OmniRead.</p>"},{"location":"core/#omniread.core--summary","title":"Summary","text":"<p>This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).</p> <p>Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.</p> <p>Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts</p> <p>Format-specific behavior must not be introduced at this layer.</p>"},{"location":"core/#omniread.core--public-api","title":"Public API","text":"<pre><code>Content\nContentType\n</code></pre>"},{"location":"core/#omniread.core-classes","title":"Classes","text":""},{"location":"core/#omniread.core.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p> Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> Notes <p>Guarantees:</p> <pre><code>- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n</code></pre> <p>Responsibilities:</p> <pre><code>- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.BaseParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully consume the provided content and return a deterministic, structured output\n</code></pre>"},{"location":"core/#omniread.core.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"core/#omniread.core.BaseScraper","title":"BaseScraper","text":"<p> Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> Notes <p>Responsibilities:</p> <pre><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n</code></pre> <p>Constraints:</p> <pre><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n</code></pre>"},{"location":"core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseScraper.fetch","title":"fetch <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.)</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.)</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n</code></pre>"},{"location":"core/#omniread.core.Content","title":"Content <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n</code></pre>"},{"location":"core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.Content.content_type","title":"content_type <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"core/#omniread.core.Content.metadata","title":"metadata <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"core/#omniread.core.Content.raw","title":"raw <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"core/#omniread.core.Content.source","title":"source <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"core/#omniread.core.ContentType","title":"ContentType","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n</code></pre>"},{"location":"core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.ContentType.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"core/#omniread.core.ContentType.JSON","title":"JSON <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"core/#omniread.core.ContentType.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"core/#omniread.core.ContentType.XML","title":"XML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"core/content/","title":"Content","text":""},{"location":"core/content/#omniread.core.content","title":"omniread.core.content","text":"<p>Canonical content models for OmniRead.</p>"},{"location":"core/content/#omniread.core.content--summary","title":"Summary","text":"<p>This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.</p> <p>The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.</p>"},{"location":"core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"core/content/#omniread.core.content.Content","title":"Content <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n</code></pre>"},{"location":"core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.Content.content_type","title":"content_type <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"core/content/#omniread.core.content.Content.metadata","title":"metadata <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"core/content/#omniread.core.content.Content.raw","title":"raw <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"core/content/#omniread.core.content.Content.source","title":"source <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"core/content/#omniread.core.content.ContentType","title":"ContentType","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n</code></pre>"},{"location":"core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.ContentType.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"core/content/#omniread.core.content.ContentType.JSON","title":"JSON <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"core/content/#omniread.core.content.ContentType.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"core/content/#omniread.core.content.ContentType.XML","title":"XML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"core/parser/","title":"Parser","text":""},{"location":"core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"<p>Abstract parsing contracts for OmniRead.</p>"},{"location":"core/parser/#omniread.core.parser--summary","title":"Summary","text":"<p>This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.</p> <p>Parsers are responsible for: - Interpreting a single <code>Content</code> instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers</p> <p>Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources</p>"},{"location":"core/parser/#omniread.core.parser-classes","title":"Classes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p> Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> Notes <p>Guarantees:</p> <pre><code>- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n</code></pre> <p>Responsibilities:</p> <pre><code>- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully consume the provided content and return a deterministic, structured output\n</code></pre>"},{"location":"core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"core/scraper/","title":"Scraper","text":""},{"location":"core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"<p>Abstract scraping contracts for OmniRead.</p>"},{"location":"core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"<p>This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.</p> <p>Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized <code>Content</code> objects</p> <p>Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing</p> <p>All interpretation must be delegated to parsers.</p>"},{"location":"core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"<p> Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> Notes <p>Responsibilities:</p> <pre><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n</code></pre> <p>Constraints:</p> <pre><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n</code></pre>"},{"location":"core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.)</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.)</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n</code></pre>"},{"location":"html/","title":"Html","text":""},{"location":"html/#omniread.html","title":"omniread.html","text":"<p>HTML format implementation for OmniRead.</p>"},{"location":"html/#omniread.html--summary","title":"Summary","text":"<p>This package provides HTML-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents</p> <p>This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in <code>omniread.core.content</code></p> <p>Consumers should depend on <code>omniread.core</code> interfaces wherever possible and use this package only when HTML-specific behavior is required.</p>"},{"location":"html/#omniread.html--public-api","title":"Public API","text":"<pre><code>HTMLScraper\nHTMLParser\n</code></pre>"},{"location":"html/#omniread.html-classes","title":"Classes","text":""},{"location":"html/#omniread.html.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n</code></pre> <p>Guarantees:</p> <pre><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/#omniread.html.HTMLParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n</code></pre>"},{"location":"html/#omniread.html.HTMLParser.parse_div","title":"parse_div <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"html/#omniread.html.HTMLParser.parse_link","title":"parse_link <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document\n- This includes: Document title, `&lt;meta&gt;` tag name/property \u2192 content mappings\n</code></pre>"},{"location":"html/#omniread.html.HTMLParser.parse_table","title":"parse_table <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"html/#omniread.html.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"html/parser/","title":"Parser","text":""},{"location":"html/parser/#omniread.html.parser","title":"omniread.html.parser","text":"<p>HTML parser base implementations for OmniRead.</p>"},{"location":"html/parser/#omniread.html.parser--summary","title":"Summary","text":"<p>This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in <code>omniread.core.parser</code>.</p> <p>It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements</p> <p>Concrete parsers must subclass <code>HTMLParser</code> and implement the <code>parse()</code> method to return a structured representation appropriate for their use case.</p>"},{"location":"html/parser/#omniread.html.parser-classes","title":"Classes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n</code></pre> <p>Guarantees:</p> <pre><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n</code></pre>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document\n- This includes: Document title, `&lt;meta&gt;` tag name/property \u2192 content mappings\n</code></pre>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"html/scraper/","title":"Scraper","text":""},{"location":"html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"<p>HTML scraping implementation for OmniRead.</p>"},{"location":"html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"<p>This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core <code>BaseScraper</code> contract using <code>httpx</code> as the transport layer.</p> <p>This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content</p> <p>This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting</p>"},{"location":"html/scraper/#omniread.html.scraper-classes","title":"Classes","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/","title":"Omniread","text":"<ul> <li>Core</li> <li>Html</li> <li>Pdf</li> </ul>"},{"location":"omniread/#omniread","title":"omniread","text":"<p>OmniRead \u2014 format-agnostic content acquisition and parsing framework.</p>"},{"location":"omniread/#omniread--summary","title":"Summary","text":"<p>OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.</p> <p>The library is structured around three core concepts:</p> <ol> <li>Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.</li> <li>Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.</li> <li>Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.</li> </ol> <p>OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior</p>"},{"location":"omniread/#omniread--installation","title":"Installation","text":"<p>Install OmniRead using pip:</p> <pre><code>pip install omniread\n</code></pre> <p>Or with Poetry:</p> <pre><code>poetry add omniread\n</code></pre>"},{"location":"omniread/#omniread--quick-start","title":"Quick start","text":"<p>HTML example:</p> <pre><code>from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n def parse(self) -&gt; str:\n return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n</code></pre> <p>PDF example:</p> <pre><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n def parse(self) -&gt; str:\n # implement PDF text extraction\n ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n</code></pre>"},{"location":"omniread/#omniread--public-api","title":"Public API","text":"<p>This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.</p> <p>Core: - Content - ContentType</p> <p>HTML: - HTMLScraper - HTMLParser</p> <p>PDF: - FileSystemPDFClient - PDFScraper - PDFParser</p> <p>Core Philosophy: <code>OmniRead</code> is designed as a decoupled content engine: 1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other. 2. Normalized Exchange: All components communicate via the <code>Content</code> model, ensuring a consistent contract. 3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.</p>"},{"location":"omniread/#omniread-classes","title":"Classes","text":""},{"location":"omniread/#omniread.Content","title":"Content <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n</code></pre>"},{"location":"omniread/#omniread.Content-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.Content.content_type","title":"content_type <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"omniread/#omniread.Content.metadata","title":"metadata <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"omniread/#omniread.Content.raw","title":"raw <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"omniread/#omniread.Content.source","title":"source <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"omniread/#omniread.ContentType","title":"ContentType","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n</code></pre>"},{"location":"omniread/#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.ContentType.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"omniread/#omniread.ContentType.JSON","title":"JSON <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"omniread/#omniread.ContentType.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"omniread/#omniread.ContentType.XML","title":"XML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"omniread/#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p> Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns their raw binary contents\n</code></pre>"},{"location":"omniread/#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"omniread/#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"omniread/#omniread.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n</code></pre> <p>Guarantees:</p> <pre><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"omniread/#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.HTMLParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"omniread/#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"omniread/#omniread.HTMLParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/#omniread.HTMLParser.parse_div","title":"parse_div <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"omniread/#omniread.HTMLParser.parse_link","title":"parse_link <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"omniread/#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document\n- This includes: Document title, `&lt;meta&gt;` tag name/property \u2192 content mappings\n</code></pre>"},{"location":"omniread/#omniread.HTMLParser.parse_table","title":"parse_table <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"omniread/#omniread.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/#omniread.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"omniread/#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"omniread/#omniread.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"omniread/#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/#omniread.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.PDFParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"omniread/#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"omniread/#omniread.PDFParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/#omniread.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/#omniread.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"omniread/#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"omniread/#omniread.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"omniread/core/","title":"Core","text":"<ul> <li>Content</li> <li>Parser</li> <li>Scraper</li> </ul>"},{"location":"omniread/core/#omniread.core","title":"omniread.core","text":"<p>Core domain contracts for OmniRead.</p>"},{"location":"omniread/core/#omniread.core--summary","title":"Summary","text":"<p>This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).</p> <p>Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.</p> <p>Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts</p> <p>Format-specific behavior must not be introduced at this layer.</p>"},{"location":"omniread/core/#omniread.core--public-api","title":"Public API","text":"<pre><code>Content\nContentType\n</code></pre>"},{"location":"omniread/core/#omniread.core-classes","title":"Classes","text":""},{"location":"omniread/core/#omniread.core.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p> Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> Notes <p>Guarantees:</p> <pre><code>- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n</code></pre> <p>Responsibilities:</p> <pre><code>- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"omniread/core/#omniread.core.BaseParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"omniread/core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"omniread/core/#omniread.core.BaseParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully consume the provided content and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/core/#omniread.core.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/core/#omniread.core.BaseScraper","title":"BaseScraper","text":"<p> Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> Notes <p>Responsibilities:</p> <pre><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n</code></pre> <p>Constraints:</p> <pre><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n</code></pre>"},{"location":"omniread/core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"omniread/core/#omniread.core.BaseScraper.fetch","title":"fetch <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.)</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.)</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n</code></pre>"},{"location":"omniread/core/#omniread.core.Content","title":"Content <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n</code></pre>"},{"location":"omniread/core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"omniread/core/#omniread.core.Content.content_type","title":"content_type <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"omniread/core/#omniread.core.Content.metadata","title":"metadata <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"omniread/core/#omniread.core.Content.raw","title":"raw <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"omniread/core/#omniread.core.Content.source","title":"source <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"omniread/core/#omniread.core.ContentType","title":"ContentType","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n</code></pre>"},{"location":"omniread/core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"omniread/core/#omniread.core.ContentType.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"omniread/core/#omniread.core.ContentType.JSON","title":"JSON <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"omniread/core/#omniread.core.ContentType.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"omniread/core/#omniread.core.ContentType.XML","title":"XML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"omniread/core/content/","title":"Content","text":""},{"location":"omniread/core/content/#omniread.core.content","title":"omniread.core.content","text":"<p>Canonical content models for OmniRead.</p>"},{"location":"omniread/core/content/#omniread.core.content--summary","title":"Summary","text":"<p>This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.</p> <p>The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.</p>"},{"location":"omniread/core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"omniread/core/content/#omniread.core.content.Content","title":"Content <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n</code></pre>"},{"location":"omniread/core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"omniread/core/content/#omniread.core.content.Content.content_type","title":"content_type <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"omniread/core/content/#omniread.core.content.Content.metadata","title":"metadata <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"omniread/core/content/#omniread.core.content.Content.raw","title":"raw <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"omniread/core/content/#omniread.core.content.Content.source","title":"source <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType","title":"ContentType","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n</code></pre>"},{"location":"omniread/core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"omniread/core/content/#omniread.core.content.ContentType.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.JSON","title":"JSON <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.XML","title":"XML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"omniread/core/parser/","title":"Parser","text":""},{"location":"omniread/core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"<p>Abstract parsing contracts for OmniRead.</p>"},{"location":"omniread/core/parser/#omniread.core.parser--summary","title":"Summary","text":"<p>This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.</p> <p>Parsers are responsible for: - Interpreting a single <code>Content</code> instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers</p> <p>Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources</p>"},{"location":"omniread/core/parser/#omniread.core.parser-classes","title":"Classes","text":""},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p> Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> Notes <p>Guarantees:</p> <pre><code>- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n</code></pre> <p>Responsibilities:</p> <pre><code>- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully consume the provided content and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/core/scraper/","title":"Scraper","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"<p>Abstract scraping contracts for OmniRead.</p>"},{"location":"omniread/core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"<p>This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.</p> <p>Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized <code>Content</code> objects</p> <p>Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing</p> <p>All interpretation must be delegated to parsers.</p>"},{"location":"omniread/core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"<p> Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> Notes <p>Responsibilities:</p> <pre><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n</code></pre> <p>Constraints:</p> <pre><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n</code></pre>"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.)</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.)</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n</code></pre>"},{"location":"omniread/html/","title":"Html","text":"<ul> <li>Parser</li> <li>Scraper</li> </ul>"},{"location":"omniread/html/#omniread.html","title":"omniread.html","text":"<p>HTML format implementation for OmniRead.</p>"},{"location":"omniread/html/#omniread.html--summary","title":"Summary","text":"<p>This package provides HTML-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents</p> <p>This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in <code>omniread.core.content</code></p> <p>Consumers should depend on <code>omniread.core</code> interfaces wherever possible and use this package only when HTML-specific behavior is required.</p>"},{"location":"omniread/html/#omniread.html--public-api","title":"Public API","text":"<pre><code>HTMLScraper\nHTMLParser\n</code></pre>"},{"location":"omniread/html/#omniread.html-classes","title":"Classes","text":""},{"location":"omniread/html/#omniread.html.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n</code></pre> <p>Guarantees:</p> <pre><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"omniread/html/#omniread.html.HTMLParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"omniread/html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"omniread/html/#omniread.html.HTMLParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_div","title":"parse_div <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_link","title":"parse_link <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document\n- This includes: Document title, `&lt;meta&gt;` tag name/property \u2192 content mappings\n</code></pre>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_table","title":"parse_table <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"omniread/html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"omniread/html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"omniread/html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/html/parser/","title":"Parser","text":""},{"location":"omniread/html/parser/#omniread.html.parser","title":"omniread.html.parser","text":"<p>HTML parser base implementations for OmniRead.</p>"},{"location":"omniread/html/parser/#omniread.html.parser--summary","title":"Summary","text":"<p>This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in <code>omniread.core.parser</code>.</p> <p>It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements</p> <p>Concrete parsers must subclass <code>HTMLParser</code> and implement the <code>parse()</code> method to return a structured representation appropriate for their use case.</p>"},{"location":"omniread/html/parser/#omniread.html.parser-classes","title":"Classes","text":""},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n</code></pre> <p>Guarantees:</p> <pre><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document\n- This includes: Document title, `&lt;meta&gt;` tag name/property \u2192 content mappings\n</code></pre>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/html/scraper/","title":"Scraper","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"<p>HTML scraping implementation for OmniRead.</p>"},{"location":"omniread/html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"<p>This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core <code>BaseScraper</code> contract using <code>httpx</code> as the transport layer.</p> <p>This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content</p> <p>This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting</p>"},{"location":"omniread/html/scraper/#omniread.html.scraper-classes","title":"Classes","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/pdf/","title":"Pdf","text":"<ul> <li>Client</li> <li>Parser</li> <li>Scraper</li> </ul>"},{"location":"omniread/pdf/#omniread.pdf","title":"omniread.pdf","text":"<p>PDF format implementation for OmniRead.</p>"},{"location":"omniread/pdf/#omniread.pdf--summary","title":"Summary","text":"<p>This package provides PDF-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries</p> <p>Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.</p>"},{"location":"omniread/pdf/#omniread.pdf--public-api","title":"Public API","text":"<pre><code>FileSystemPDFClient\nPDFScraper\nPDFParser\n</code></pre>"},{"location":"omniread/pdf/#omniread.pdf-classes","title":"Classes","text":""},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p> Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns their raw binary contents\n</code></pre>"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"omniread/pdf/#omniread.pdf.PDFParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"omniread/pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"omniread/pdf/client/","title":"Client","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"<p>PDF client abstractions for OmniRead.</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"<p>This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.</p> <p>Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.</p> <p>Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client-classes","title":"Classes","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"<p> Bases: <code>ABC</code></p> <p>Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure\n</code></pre>"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch <code>abstractmethod</code>","text":"<pre><code>fetch(source: Any) -&gt; bytes\n</code></pre> <p>Fetch raw PDF bytes from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF location, such as a file path, object storage key, or remote reference.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors defined by the implementation.</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p> Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns their raw binary contents\n</code></pre>"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"omniread/pdf/parser/","title":"Parser","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"<p>PDF parser base implementations for OmniRead.</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"<p>This module defines the PDF-specific parser contract, extending the format-agnostic <code>BaseParser</code> with constraints appropriate for PDF content.</p> <p>PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n</code></pre>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/pdf/scraper/","title":"Scraper","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"<p>PDF scraping implementation for OmniRead.</p>"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"<p>This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a <code>Content</code> object.</p> <p>The scraper implements the core <code>BaseScraper</code> contract while delegating all storage and access concerns to a <code>BasePDFClient</code> implementation.</p>"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper-classes","title":"Classes","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"pdf/","title":"Pdf","text":""},{"location":"pdf/#omniread.pdf","title":"omniread.pdf","text":"<p>PDF format implementation for OmniRead.</p>"},{"location":"pdf/#omniread.pdf--summary","title":"Summary","text":"<p>This package provides PDF-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries</p> <p>Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.</p>"},{"location":"pdf/#omniread.pdf--public-api","title":"Public API","text":"<pre><code>FileSystemPDFClient\nPDFScraper\nPDFParser\n</code></pre>"},{"location":"pdf/#omniread.pdf-classes","title":"Classes","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p> Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns their raw binary contents\n</code></pre>"},{"location":"pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n</code></pre>"},{"location":"pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"pdf/client/","title":"Client","text":""},{"location":"pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"<p>PDF client abstractions for OmniRead.</p>"},{"location":"pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"<p>This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.</p> <p>Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.</p> <p>Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems</p>"},{"location":"pdf/client/#omniread.pdf.client-classes","title":"Classes","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"<p> Bases: <code>ABC</code></p> <p>Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure\n</code></pre>"},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch <code>abstractmethod</code>","text":"<pre><code>fetch(source: Any) -&gt; bytes\n</code></pre> <p>Fetch raw PDF bytes from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF location, such as a file path, object storage key, or remote reference.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors defined by the implementation.</p>"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p> Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns their raw binary contents\n</code></pre>"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"pdf/parser/","title":"Parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"<p>PDF parser base implementations for OmniRead.</p>"},{"location":"pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"<p>This module defines the PDF-specific parser contract, extending the format-agnostic <code>BaseParser</code> with constraints appropriate for PDF content.</p> <p>PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.</p>"},{"location":"pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p> Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n</code></pre>"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"pdf/scraper/","title":"Scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"<p>PDF scraping implementation for OmniRead.</p>"},{"location":"pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"<p>This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a <code>Content</code> object.</p> <p>The scraper implements the core <code>BaseScraper</code> contract while delegating all storage and access concerns to a <code>BasePDFClient</code> implementation.</p>"},{"location":"pdf/scraper/#omniread.pdf.scraper-classes","title":"Classes","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p> Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"}]}