docs/libs/omniread/site/search/search_index.json


			
				
					
						
						
						
							
							
							{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"omniread/","title":"Omniread","text":""},{"location":"omniread/#omniread","title":"omniread","text":"<p>OmniRead \u2014 format-agnostic content acquisition and parsing framework.</p> <p>OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.</p> <p>The library is structured around three core concepts:</p> <ol> <li> <p>Content    A canonical, format-agnostic container representing raw content bytes    and minimal contextual metadata.</p> </li> <li> <p>Scrapers    Components responsible for acquiring raw content from a source    (HTTP, filesystem, object storage, etc.). Scrapers never interpret    content.</p> </li> <li> <p>Parsers    Components responsible for interpreting acquired content and    converting it into structured, typed representations.</p> </li> </ol> <p>OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior</p>"},{"location":"omniread/#omniread--installation","title":"Installation","text":"<p>Install OmniRead using pip:</p> <pre><code>pip install omniread\n</code></pre> <p>Or with Poetry:</p> <pre><code>poetry add omniread\n</code></pre>"},{"location":"omniread/#omniread--basic-usage","title":"Basic Usage","text":"<p>HTML example:</p> <pre><code>from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n    def parse(self) -&gt; str:\n        return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n</code></pre> <p>PDF example:</p> <pre><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n    def parse(self) -&gt; str:\n        # implement PDF text extraction\n        ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n</code></pre>"},{"location":"omniread/#omniread--public-api-surface","title":"Public API Surface","text":"<p>This module re-exports the recommended public entry points of OmniRead.</p> <p>Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.</p> <p>Core: - Content - ContentType</p> <p>HTML: - HTMLScraper - HTMLParser</p> <p>PDF: - FileSystemPDFClient - PDFScraper - PDFParser</p>"},{"location":"omniread/#omniread--core-philosophy","title":"Core Philosophy","text":"<p><code>OmniRead</code> is designed as a decoupled content engine:</p> <ol> <li>Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.</li> <li>Normalized Exchange: All components communicate via the <code>Content</code> model, ensuring a consistent contract.</li> <li>Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.</li> </ol>"},{"location":"omniread/#omniread--documentation-design","title":"Documentation Design","text":"<p>For those extending <code>OmniRead</code>, follow these \"AI-Native\" docstring principles:</p>"},{"location":"omniread/#omniread--for-humans","title":"For Humans","text":"<ul> <li>Clear Contracts: Explicitly state what a component is and is NOT responsible for.</li> <li>Runnable Examples: Include small, logical snippets in the package <code>__init__.py</code>.</li> </ul>"},{"location":"omniread/#omniread--for-llms","title":"For LLMs","text":"<ul> <li>Structured Models: Use dataclasses and enums for core data to ensure clean MCP JSON representation.</li> <li>Type Safety: All public APIs must be fully typed and have corresponding <code>.pyi</code> stubs.</li> <li>Detailed Raises: Include <code>: description</code> pairs in the <code>Raises</code> section to help agents handle errors gracefully.</li> </ul>"},{"location":"omniread/#omniread.Content","title":"Content  <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> <p>A <code>Content</code> instance represents a raw content payload along with minimal contextual metadata describing its origin and type.</p> <p>This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers</p> <p>Attributes:</p> Name Type Description <code>raw</code> <code>bytes</code> <p>Raw content bytes as retrieved from the source.</p> <code>source</code> <code>str</code> <p>Identifier of the content origin (URL, file path, or logical name).</p> <code>content_type</code> <code>Optional[ContentType]</code> <p>Optional MIME type of the content, if known.</p> <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"omniread/#omniread.ContentType","title":"ContentType","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> <p>This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.</p>"},{"location":"omniread/#omniread.ContentType.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"omniread/#omniread.ContentType.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"omniread/#omniread.ContentType.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"omniread/#omniread.ContentType.XML","title":"XML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"omniread/#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p>               Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> <p>This client reads PDF files directly from the disk and returns their raw binary contents.</p>"},{"location":"omniread/#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Type Description <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"omniread/#omniread.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> <p>This class extends the core <code>BaseParser</code> with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.</p> <p>Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.</p> <p>Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures</p> <p>Concrete subclasses must: - Define the output type <code>T</code> - Implement the <code>parse()</code> method</p> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"omniread/#omniread.HTMLParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"omniread/#omniread.HTMLParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Implementations must fully interpret the HTML DOM and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed representation of type <code>T</code>.</p>"},{"location":"omniread/#omniread.HTMLParser.parse_div","title":"parse_div  <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Type Description <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"omniread/#omniread.HTMLParser.parse_link","title":"parse_link  <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"omniread/#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>This includes: - Document title - <code>&lt;meta&gt;</code> tag name/property \u2192 content mappings</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>Dictionary containing extracted metadata.</p>"},{"location":"omniread/#omniread.HTMLParser.parse_table","title":"parse_table  <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>A list of rows, where each row is a list of cell text values.</p>"},{"location":"omniread/#omniread.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/#omniread.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> <p>This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a <code>Content</code> object.</p> <p>Fetches raw bytes and metadata only. The scraper: - Uses <code>httpx.Client</code> for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata</p> <p>The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses</p> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Optional[Client]</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"omniread/#omniread.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>A <code>Content</code> instance containing:</p> <code>Content</code> <ul> <li>Raw HTML bytes</li> </ul> <code>Content</code> <ul> <li>Source URL</li> </ul> <code>Content</code> <ul> <li>HTML content type</li> </ul> <code>Content</code> <ul> <li>HTTP response metadata</li> </ul> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"omniread/#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/#omniread.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> <p>This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.</p> <p>Concrete implementations must define: - Define the output type <code>T</code> - Implement the <code>parse()</code> method</p> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/#omniread.PDFParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"omniread/#omniread.PDFParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p>"},{"location":"omniread/#omniread.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/#omniread.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> <p>Delegates byte retrieval to a PDF client and normalizes output into Content.</p> <p>The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata</p> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"omniread/#omniread.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>A <code>Content</code> instance containing:</p> <code>Content</code> <ul> <li>Raw PDF bytes</li> </ul> <code>Content</code> <ul> <li>Source identifier</li> </ul> <code>Content</code> <ul> <li>PDF content type</li> </ul> <code>Content</code> <ul> <li>Optional metadata</li> </ul> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"omniread/core/","title":"Core","text":""},{"location":"omniread/core/#omniread.core","title":"omniread.core","text":"<p>Core domain contracts for OmniRead.</p> <p>This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).</p> <p>Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.</p> <p>Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts</p> <p>Format-specific behavior must not be introduced at this layer.</p>"},{"location":"omniread/core/#omniread.core.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p>               Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> <p>A parser is a self-contained object that owns the Content it is responsible for interpreting.</p> <p>Implementations must: - Declare supported content types via <code>supported_types</code> - Raise parsing-specific exceptions from <code>parse()</code> - Remain deterministic for a given input</p> <p>Consumers may rely on: - Early validation of content compatibility - Type-stable return values from <code>parse()</code></p> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/core/#omniread.core.BaseParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser.</p> <p>An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"omniread/core/#omniread.core.BaseParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Implementations must fully consume the provided content and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p>"},{"location":"omniread/core/#omniread.core.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/core/#omniread.core.BaseScraper","title":"BaseScraper","text":"<p>               Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> <p>A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.</p> <p>A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a <code>Content</code> object.</p> <p>Scrapers define how content is obtained, not what the content means.</p> <p>Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior</p> <p>Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser</p>"},{"location":"omniread/core/#omniread.core.BaseScraper.fetch","title":"fetch  <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Implementations must retrieve the content referenced by <code>source</code> and return it as raw bytes wrapped in a <code>Content</code> object.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.)</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.)</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <code>Content</code> <ul> <li>Raw content bytes</li> </ul> <code>Content</code> <ul> <li>Source identifier</li> </ul> <code>Content</code> <ul> <li>Optional metadata</li> </ul> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p>"},{"location":"omniread/core/#omniread.core.Content","title":"Content  <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> <p>A <code>Content</code> instance represents a raw content payload along with minimal contextual metadata describing its origin and type.</p> <p>This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers</p> <p>Attributes:</p> Name Type Description <code>raw</code> <code>bytes</code> <p>Raw content bytes as retrieved from the source.</p> <code>source</code> <code>str</code> <p>Identifier of the content origin (URL, file path, or logical name).</p> <code>content_type</code> <code>Optional[ContentType]</code> <p>Optional MIME type of the content, if known.</p> <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"omniread/core/#omniread.core.ContentType","title":"ContentType","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> <p>This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.</p>"},{"location":"omniread/core/#omniread.core.ContentType.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"omniread/core/#omniread.core.ContentType.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"omniread/core/#omniread.core.ContentType.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"omniread/core/#omniread.core.ContentType.XML","title":"XML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"omniread/core/content/","title":"Content","text":""},{"location":"omniread/core/content/#omniread.core.content","title":"omniread.core.content","text":"<p>Canonical content models for OmniRead.</p> <p>This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.</p> <p>The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.</p>"},{"location":"omniread/core/content/#omniread.core.content.Content","title":"Content  <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> <p>A <code>Content</code> instance represents a raw content payload along with minimal contextual metadata describing its origin and type.</p> <p>This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers</p> <p>Attributes:</p> Name Type Description <code>raw</code> <code>bytes</code> <p>Raw content bytes as retrieved from the source.</p> <code>source</code> <code>str</code> <p>Identifier of the content origin (URL, file path, or logical name).</p> <code>content_type</code> <code>Optional[ContentType]</code> <p>Optional MIME type of the content, if known.</p> <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType","title":"ContentType","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> <p>This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"omniread/core/content/#omniread.core.content.ContentType.XML","title":"XML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"omniread/core/parser/","title":"Parser","text":""},{"location":"omniread/core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"<p>Abstract parsing contracts for OmniRead.</p> <p>This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.</p> <p>Parsers are responsible for: - Interpreting a single <code>Content</code> instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers</p> <p>Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources</p>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p>               Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> <p>A parser is a self-contained object that owns the Content it is responsible for interpreting.</p> <p>Implementations must: - Declare supported content types via <code>supported_types</code> - Raise parsing-specific exceptions from <code>parse()</code> - Remain deterministic for a given input</p> <p>Consumers may rely on: - Early validation of content compatibility - Type-stable return values from <code>parse()</code></p> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser.</p> <p>An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Implementations must fully consume the provided content and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p>"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/core/scraper/","title":"Scraper","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"<p>Abstract scraping contracts for OmniRead.</p> <p>This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.</p> <p>Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized <code>Content</code> objects</p> <p>Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing</p> <p>All interpretation must be delegated to parsers.</p>"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"<p>               Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> <p>A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.</p> <p>A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a <code>Content</code> object.</p> <p>Scrapers define how content is obtained, not what the content means.</p> <p>Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior</p> <p>Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser</p>"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch  <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Implementations must retrieve the content referenced by <code>source</code> and return it as raw bytes wrapped in a <code>Content</code> object.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.)</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.)</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <code>Content</code> <ul> <li>Raw content bytes</li> </ul> <code>Content</code> <ul> <li>Source identifier</li> </ul> <code>Content</code> <ul> <li>Optional metadata</li> </ul> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p>"},{"location":"omniread/html/","title":"Html","text":""},{"location":"omniread/html/#omniread.html","title":"omniread.html","text":"<p>HTML format implementation for OmniRead.</p> <p>This package provides HTML-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents</p> <p>This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in <code>omniread.core.content</code></p> <p>Consumers should depend on <code>omniread.core</code> interfaces wherever possible and use this package only when HTML-specific behavior is required.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> <p>This class extends the core <code>BaseParser</code> with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.</p> <p>Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.</p> <p>Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures</p> <p>Concrete subclasses must: - Define the output type <code>T</code> - Implement the <code>parse()</code> method</p> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Implementations must fully interpret the HTML DOM and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed representation of type <code>T</code>.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_div","title":"parse_div  <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Type Description <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_link","title":"parse_link  <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>This includes: - Document title - <code>&lt;meta&gt;</code> tag name/property \u2192 content mappings</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>Dictionary containing extracted metadata.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_table","title":"parse_table  <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>A list of rows, where each row is a list of cell text values.</p>"},{"location":"omniread/html/#omniread.html.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> <p>This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a <code>Content</code> object.</p> <p>Fetches raw bytes and metadata only. The scraper: - Uses <code>httpx.Client</code> for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata</p> <p>The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses</p> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Optional[Client]</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"omniread/html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>A <code>Content</code> instance containing:</p> <code>Content</code> <ul> <li>Raw HTML bytes</li> </ul> <code>Content</code> <ul> <li>Source URL</li> </ul> <code>Content</code> <ul> <li>HTML content type</li> </ul> <code>Content</code> <ul> <li>HTTP response metadata</li> </ul> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"omniread/html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/html/parser/","title":"Parser","text":""},{"location":"omniread/html/parser/#omniread.html.parser","title":"omniread.html.parser","text":"<p>HTML parser base implementations for OmniRead.</p> <p>This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in <code>omniread.core.parser</code>.</p> <p>It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements</p> <p>Concrete parsers must subclass <code>HTMLParser</code> and implement the <code>parse()</code> method to return a structured representation appropriate for their use case.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> <p>This class extends the core <code>BaseParser</code> with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.</p> <p>Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.</p> <p>Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures</p> <p>Concrete subclasses must: - Define the output type <code>T</code> - Implement the <code>parse()</code> method</p> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Implementations must fully interpret the HTML DOM and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed representation of type <code>T</code>.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div  <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Type Description <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link  <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>This includes: - Document title - <code>&lt;meta&gt;</code> tag name/property \u2192 content mappings</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>Dictionary containing extracted metadata.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table  <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>A list of rows, where each row is a list of cell text values.</p>"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/html/scraper/","title":"Scraper","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"<p>HTML scraping implementation for OmniRead.</p> <p>This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core <code>BaseScraper</code> contract using <code>httpx</code> as the transport layer.</p> <p>This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content</p> <p>This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting</p>"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using httpx.</p> <p>This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a <code>Content</code> object.</p> <p>Fetches raw bytes and metadata only. The scraper: - Uses <code>httpx.Client</code> for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata</p> <p>The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses</p> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Optional[Client]</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>A <code>Content</code> instance containing:</p> <code>Content</code> <ul> <li>Raw HTML bytes</li> </ul> <code>Content</code> <ul> <li>Source URL</li> </ul> <code>Content</code> <ul> <li>HTML content type</li> </ul> <code>Content</code> <ul> <li>HTTP response metadata</li> </ul> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"omniread/pdf/","title":"Pdf","text":""},{"location":"omniread/pdf/#omniread.pdf","title":"omniread.pdf","text":"<p>PDF format implementation for OmniRead.</p> <p>This package provides PDF-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries</p> <p>Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.</p>"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p>               Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> <p>This client reads PDF files directly from the disk and returns their raw binary contents.</p>"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Type Description <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> <p>This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.</p> <p>Concrete implementations must define: - Define the output type <code>T</code> - Implement the <code>parse()</code> method</p> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> <p>Delegates byte retrieval to a PDF client and normalizes output into Content.</p> <p>The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata</p> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>A <code>Content</code> instance containing:</p> <code>Content</code> <ul> <li>Raw PDF bytes</li> </ul> <code>Content</code> <ul> <li>Source identifier</li> </ul> <code>Content</code> <ul> <li>PDF content type</li> </ul> <code>Content</code> <ul> <li>Optional metadata</li> </ul> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"omniread/pdf/client/","title":"Client","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"<p>PDF client abstractions for OmniRead.</p> <p>This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.</p> <p>Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.</p> <p>Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"<p>               Bases: <code>ABC</code></p> <p>Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).</p> <p>Implementations must: - Accept a source identifier appropriate to the backing store - Return the full PDF binary payload - Raise retrieval-specific errors on failure</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch  <code>abstractmethod</code>","text":"<pre><code>fetch(source: Any) -&gt; bytes\n</code></pre> <p>Fetch raw PDF bytes from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF location, such as a file path, object storage key, or remote reference.</p> required <p>Returns:</p> Type Description <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors defined by the implementation.</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p>               Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> <p>This client reads PDF files directly from the disk and returns their raw binary contents.</p>"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Type Description <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"omniread/pdf/parser/","title":"Parser","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"<p>PDF parser base implementations for OmniRead.</p> <p>This module defines the PDF-specific parser contract, extending the format-agnostic <code>BaseParser</code> with constraints appropriate for PDF content.</p> <p>PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> <p>This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.</p> <p>Concrete implementations must define: - Define the output type <code>T</code> - Implement the <code>parse()</code> method</p> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.</p> <p>Returns:</p> Type Description <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p>"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Type Description <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"omniread/pdf/scraper/","title":"Scraper","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"<p>PDF scraping implementation for OmniRead.</p> <p>This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a <code>Content</code> object.</p> <p>The scraper implements the core <code>BaseScraper</code> contract while delegating all storage and access concerns to a <code>BasePDFClient</code> implementation.</p>"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> <p>Delegates byte retrieval to a PDF client and normalizes output into Content.</p> <p>The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata</p> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Type Description <code>Content</code> <p>A <code>Content</code> instance containing:</p> <code>Content</code> <ul> <li>Raw PDF bytes</li> </ul> <code>Content</code> <ul> <li>Source identifier</li> </ul> <code>Content</code> <ul> <li>PDF content type</li> </ul> <code>Content</code> <ul> <li>Optional metadata</li> </ul> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"}]}
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink