docs/libs/omniread/site/search/search_index.json


			
				
					
						
						
						
							
							
							{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"omniread","text":""},{"location":"#omniread","title":"omniread","text":""},{"location":"#omniread--summary","title":"Summary","text":"<p><code>OmniRead</code> \u2014 format-agnostic content acquisition and parsing framework.</p> <p><code>OmniRead</code> provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.</p> <p>The library is structured around three core concepts:</p> <ol> <li><code>Content</code>: A canonical, format-agnostic container representing raw content     bytes and minimal contextual metadata.</li> <li><code>Scrapers</code>: Components responsible for acquiring raw content from a     source (HTTP, filesystem, object storage, etc.). <code>Scrapers</code> never interpret     content.</li> <li><code>Parsers</code>: Components responsible for interpreting acquired content and     converting it into structured, typed representations.</li> </ol> <p><code>OmniRead</code> deliberately separates these responsibilities to ensure:</p> <ul> <li>Clear boundaries between IO and interpretation.</li> <li>Replaceable implementations per format.</li> <li>Predictable, testable behavior.</li> </ul>"},{"location":"#omniread--installation","title":"Installation","text":"<p>Install <code>OmniRead</code> using pip:</p> <pre><code>pip install omniread\n</code></pre> <p>Install OmniRead using Poetry: <pre><code>poetry add omniread\n</code></pre></p>"},{"location":"#omniread--quick-start","title":"Quick start","text":"Example <p>HTML example:     <pre><code>from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n    def parse(self) -&gt; str:\n        return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n</code></pre></p> <p>PDF example:     <pre><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n    def parse(self) -&gt; str:\n        # implement PDF text extraction\n        ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n</code></pre></p>"},{"location":"#omniread--public-api","title":"Public API","text":"<p>This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.</p> <ul> <li><code>Content</code>: Canonical content model.</li> <li><code>ContentType</code>: Supported media types.</li> <li><code>HTMLScraper</code>: HTTP-based HTML acquisition.</li> <li><code>HTMLParser</code>: Base parser for HTML DOM interpretation.</li> <li><code>FileSystemPDFClient</code>: Local filesystem PDF access.</li> <li><code>PDFScraper</code>: PDF-specific content acquisition.</li> <li><code>PDFParser</code>: Base parser for PDF binary interpretation.</li> </ul>"},{"location":"#omniread--core-philosophy","title":"Core Philosophy","text":"<p><code>OmniRead</code> is designed as a decoupled content engine:</p> <ol> <li>Separation of Concerns: Scrapers fetch, Parsers interpret. Neither    knows about the other.</li> <li>Normalized Exchange: All components communicate via the <code>Content</code> model,    ensuring a consistent contract.</li> <li>Format Agnosticism: The core logic is independent of whether the input    is HTML, PDF, or JSON.</li> </ol>"},{"location":"#omniread-classes","title":"Classes","text":""},{"location":"#omniread.Content","title":"Content  <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with\n  minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n  parsers, and downstream consumers.\n</code></pre>"},{"location":"#omniread.Content-attributes","title":"Attributes","text":""},{"location":"#omniread.Content.content_type","title":"content_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"#omniread.Content.metadata","title":"metadata  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"#omniread.Content.raw","title":"raw  <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"#omniread.Content.source","title":"source  <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"#omniread.ContentType","title":"ContentType","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the\n  content source.\n- It is primarily used for routing content to the appropriate\n  parser or downstream consumer.\n</code></pre>"},{"location":"#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"#omniread.ContentType.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"#omniread.ContentType.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"#omniread.ContentType.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"#omniread.ContentType.XML","title":"XML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p>               Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns\n  their raw binary contents.\n</code></pre>"},{"location":"#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"#omniread.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior,\n  including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n  explicitly define the return type.\n</code></pre> <p>Guarantees:</p> <pre><code>- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement\n  the `parse()` method.\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"#omniread.HTMLParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"#omniread.HTMLParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a\n  deterministic, structured output.\n</code></pre>"},{"location":"#omniread.HTMLParser.parse_div","title":"parse_div  <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"#omniread.HTMLParser.parse_link","title":"parse_link  <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document.\n- This includes: Document title, `&lt;meta&gt;` tag name/property to\n  content mappings.\n</code></pre>"},{"location":"#omniread.HTMLParser.parse_table","title":"parse_table  <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"#omniread.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"#omniread.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using <code>httpx</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns\n  them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n  HTML content type, and preserves HTTP response metadata.\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff,\n  handle non-HTML responses.\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"#omniread.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"#omniread.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides\n  the extension point for implementing concrete PDF parsing strategies.\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must define the output type `T` and\n  implement the `parse()` method.\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"#omniread.PDFParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"#omniread.PDFParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and\n  return a deterministic, structured output.\n</code></pre>"},{"location":"#omniread.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"#omniread.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output\n  into `Content`.\n- Preserves caller-provided metadata.\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"#omniread.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"core/","title":"Core","text":""},{"location":"core/#omniread.core","title":"omniread.core","text":""},{"location":"core/#omniread.core--summary","title":"Summary","text":"<p>Core domain contracts for OmniRead.</p> <p>This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).</p> <p>Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.</p> <p>Submodules:</p> <ul> <li><code>content</code>: Canonical content models and enums.</li> <li><code>parser</code>: Abstract parsing contracts.</li> <li><code>scraper</code>: Abstract scraping contracts.</li> </ul> <p>Format-specific behavior must not be introduced at this layer.</p>"},{"location":"core/#omniread.core--public-api","title":"Public API","text":"<ul> <li><code>Content</code></li> <li><code>ContentType</code></li> </ul>"},{"location":"core/#omniread.core-classes","title":"Classes","text":""},{"location":"core/#omniread.core.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p>               Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> Notes <p>Guarantees:</p> <pre><code>- A parser is a self-contained object that owns the `Content` it is\n  responsible for interpreting.\n- Consumers may rely on early validation of content compatibility\n  and type-stable return values from `parse()`.\n</code></pre> <p>Responsibilities:</p> <pre><code>- Implementations must declare supported content types via `supported_types`.\n- Implementations must raise parsing-specific exceptions from `parse()`.\n- Implementations must remain deterministic for a given input.\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.BaseParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully consume the provided content and\n  return a deterministic, structured output.\n</code></pre>"},{"location":"core/#omniread.core.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"core/#omniread.core.BaseScraper","title":"BaseScraper","text":"<p>               Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> Notes <p>Responsibilities:</p> <pre><code>- A scraper is responsible ONLY for fetching raw content (bytes)\n  from a source. It must not interpret or parse it.\n- A scraper is a stateless acquisition component that retrieves raw\n  content from a source and returns it as a `Content` object.\n- Scrapers define how content is obtained, not what the content means.\n- Implementations may vary in transport mechanism, authentication\n  strategy, retry and backoff behavior.\n</code></pre> <p>Constraints:</p> <pre><code>- Implementations must not parse content, modify content semantics,\n  or couple scraping logic to a specific parser.\n</code></pre>"},{"location":"core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseScraper.fetch","title":"fetch  <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.).</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.).</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must retrieve the content referenced by `source`\n  and return it as raw bytes wrapped in a `Content` object.\n</code></pre>"},{"location":"core/#omniread.core.Content","title":"Content  <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with\n  minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n  parsers, and downstream consumers.\n</code></pre>"},{"location":"core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.Content.content_type","title":"content_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"core/#omniread.core.Content.metadata","title":"metadata  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"core/#omniread.core.Content.raw","title":"raw  <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"core/#omniread.core.Content.source","title":"source  <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"core/#omniread.core.ContentType","title":"ContentType","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the\n  content source.\n- It is primarily used for routing content to the appropriate\n  parser or downstream consumer.\n</code></pre>"},{"location":"core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.ContentType.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"core/#omniread.core.ContentType.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"core/#omniread.core.ContentType.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"core/#omniread.core.ContentType.XML","title":"XML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"core/content/","title":"Content","text":""},{"location":"core/content/#omniread.core.content","title":"omniread.core.content","text":""},{"location":"core/content/#omniread.core.content--summary","title":"Summary","text":"<p>Canonical content models for OmniRead.</p> <p>This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.</p> <p>The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.</p>"},{"location":"core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"core/content/#omniread.core.content.Content","title":"Content  <code>dataclass</code>","text":"<pre><code>Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n</code></pre> <p>Normalized representation of extracted content.</p> Notes <p>Responsibilities:</p> <pre><code>- A `Content` instance represents a raw content payload along with\n  minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n  parsers, and downstream consumers.\n</code></pre>"},{"location":"core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.Content.content_type","title":"content_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>content_type: Optional[ContentType] = None\n</code></pre> <p>Optional MIME type of the content, if known.</p>"},{"location":"core/content/#omniread.core.content.Content.metadata","title":"metadata  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>metadata: Optional[Mapping[str, Any]] = None\n</code></pre> <p>Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).</p>"},{"location":"core/content/#omniread.core.content.Content.raw","title":"raw  <code>instance-attribute</code>","text":"<pre><code>raw: bytes\n</code></pre> <p>Raw content bytes as retrieved from the source.</p>"},{"location":"core/content/#omniread.core.content.Content.source","title":"source  <code>instance-attribute</code>","text":"<pre><code>source: str\n</code></pre> <p>Identifier of the content origin (URL, file path, or logical name).</p>"},{"location":"core/content/#omniread.core.content.ContentType","title":"ContentType","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Supported MIME types for extracted content.</p> Notes <p>Guarantees:</p> <pre><code>- This enum represents the declared or inferred media type of the\n  content source.\n- It is primarily used for routing content to the appropriate\n  parser or downstream consumer.\n</code></pre>"},{"location":"core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.ContentType.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'text/html'\n</code></pre> <p>HTML document content.</p>"},{"location":"core/content/#omniread.core.content.ContentType.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'application/json'\n</code></pre> <p>JSON document content.</p>"},{"location":"core/content/#omniread.core.content.ContentType.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'application/pdf'\n</code></pre> <p>PDF document content.</p>"},{"location":"core/content/#omniread.core.content.ContentType.XML","title":"XML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML = 'application/xml'\n</code></pre> <p>XML document content.</p>"},{"location":"core/parser/","title":"Parser","text":""},{"location":"core/parser/#omniread.core.parser","title":"omniread.core.parser","text":""},{"location":"core/parser/#omniread.core.parser--summary","title":"Summary","text":"<p>Abstract parsing contracts for OmniRead.</p> <p>This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.</p> <p>Parsers are responsible for:</p> <ul> <li>Interpreting a single <code>Content</code> instance</li> <li>Validating compatibility with the content type</li> <li>Producing a structured output suitable for downstream consumers</li> </ul> <p>Parsers are not responsible for:</p> <ul> <li>Fetching or acquiring content</li> <li>Performing retries or error recovery</li> <li>Managing multiple content sources</li> </ul>"},{"location":"core/parser/#omniread.core.parser-classes","title":"Classes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"<pre><code>BaseParser(content: Content)\n</code></pre> <p>               Bases: <code>ABC</code>, <code>Generic[T]</code></p> <p>Base interface for all parsers.</p> Notes <p>Guarantees:</p> <pre><code>- A parser is a self-contained object that owns the `Content` it is\n  responsible for interpreting.\n- Consumers may rely on early validation of content compatibility\n  and type-stable return values from `parse()`.\n</code></pre> <p>Responsibilities:</p> <pre><code>- Implementations must declare supported content types via `supported_types`.\n- Implementations must raise parsing-specific exceptions from `parse()`.\n- Implementations must remain deterministic for a given input.\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: Set[ContentType] = set()\n</code></pre> <p>Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.</p>"},{"location":"core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse the owned content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed, structured representation.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully consume the provided content and\n  return a deterministic, structured output.\n</code></pre>"},{"location":"core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"core/scraper/","title":"Scraper","text":""},{"location":"core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":""},{"location":"core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"<p>Abstract scraping contracts for OmniRead.</p> <p>This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.</p> <p>Scrapers are responsible for:</p> <ul> <li>Locating and retrieving raw content bytes</li> <li>Attaching minimal contextual metadata</li> <li>Returning normalized <code>Content</code> objects</li> </ul> <p>Scrapers are explicitly NOT responsible for:</p> <ul> <li>Parsing or interpreting content</li> <li>Inferring structure or semantics</li> <li>Performing content-type specific processing</li> </ul> <p>All interpretation must be delegated to parsers.</p>"},{"location":"core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"<p>               Bases: <code>ABC</code></p> <p>Base interface for all scrapers.</p> Notes <p>Responsibilities:</p> <pre><code>- A scraper is responsible ONLY for fetching raw content (bytes)\n  from a source. It must not interpret or parse it.\n- A scraper is a stateless acquisition component that retrieves raw\n  content from a source and returns it as a `Content` object.\n- Scrapers define how content is obtained, not what the content means.\n- Implementations may vary in transport mechanism, authentication\n  strategy, retry and backoff behavior.\n</code></pre> <p>Constraints:</p> <pre><code>- Implementations must not parse content, modify content semantics,\n  or couple scraping logic to a specific parser.\n</code></pre>"},{"location":"core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch  <code>abstractmethod</code>","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch raw content from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>Location identifier (URL, file path, S3 URI, etc.).</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional hints for the scraper (headers, auth, etc.).</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>Content object containing raw bytes and metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must retrieve the content referenced by `source`\n  and return it as raw bytes wrapped in a `Content` object.\n</code></pre>"},{"location":"html/","title":"Html","text":""},{"location":"html/#omniread.html","title":"omniread.html","text":""},{"location":"html/#omniread.html--summary","title":"Summary","text":"<p>HTML format implementation for OmniRead.</p> <p>This package provides HTML-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>It includes:</p> <ul> <li>HTML parsers that interpret HTML content.</li> <li>HTML scrapers that retrieve HTML documents.</li> </ul> <p>Key characteristics:</p> <ul> <li>Implements, but does not redefine, core contracts.</li> <li>May contain HTML-specific behavior and edge-case handling.</li> <li>Produces canonical content models defined in <code>omniread.core.content</code>.</li> </ul> <p>Consumers should depend on <code>omniread.core</code> interfaces wherever possible and use this package only when HTML-specific behavior is required.</p>"},{"location":"html/#omniread.html--public-api","title":"Public API","text":"<ul> <li><code>HTMLScraper</code></li> <li><code>HTMLParser</code></li> </ul>"},{"location":"html/#omniread.html-classes","title":"Classes","text":""},{"location":"html/#omniread.html.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior,\n  including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n  explicitly define the return type.\n</code></pre> <p>Guarantees:</p> <pre><code>- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement\n  the `parse()` method.\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/#omniread.html.HTMLParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a\n  deterministic, structured output.\n</code></pre>"},{"location":"html/#omniread.html.HTMLParser.parse_div","title":"parse_div  <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"html/#omniread.html.HTMLParser.parse_link","title":"parse_link  <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document.\n- This includes: Document title, `&lt;meta&gt;` tag name/property to\n  content mappings.\n</code></pre>"},{"location":"html/#omniread.html.HTMLParser.parse_table","title":"parse_table  <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"html/#omniread.html.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using <code>httpx</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns\n  them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n  HTML content type, and preserves HTTP response metadata.\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff,\n  handle non-HTML responses.\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"html/parser/","title":"Parser","text":""},{"location":"html/parser/#omniread.html.parser","title":"omniread.html.parser","text":""},{"location":"html/parser/#omniread.html.parser--summary","title":"Summary","text":"<p>HTML parser base implementations for OmniRead.</p> <p>This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in <code>omniread.core.parser</code>.</p> <p>It supplies:</p> <ul> <li>Content-type enforcement for HTML inputs</li> <li>BeautifulSoup initialization and lifecycle management</li> <li>Common helper methods for extracting structured data from HTML elements</li> </ul> <p>Concrete parsers must subclass <code>HTMLParser</code> and implement the <code>parse()</code> method to return a structured representation appropriate for their use case.</p>"},{"location":"html/parser/#omniread.html.parser-classes","title":"Classes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"<pre><code>HTMLParser(content: Content, features: str = 'html.parser')\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base HTML parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class extends the core `BaseParser` with HTML-specific behavior,\n  including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n  explicitly define the return type.\n</code></pre> <p>Guarantees:</p> <pre><code>- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete subclasses must define the output type `T` and implement\n  the `parse()` method.\n</code></pre> <p>Initialize the HTML parser.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>HTML content to be parsed.</p> required <code>features</code> <code>str</code> <p>BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').</p> <code>'html.parser'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content is empty or not valid HTML.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {HTML}\n</code></pre> <p>Set of content types supported by this parser (HTML only).</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Fully parse the HTML content into structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the HTML DOM and return a\n  deterministic, structured output.\n</code></pre>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div  <code>staticmethod</code>","text":"<pre><code>parse_div(div: Tag, *, separator: str = ' ') -&gt; str\n</code></pre> <p>Extract normalized text from a <code>&lt;div&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>div</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;div&gt;</code>.</p> required <code>separator</code> <code>str</code> <p>String used to separate text nodes.</p> <code>' '</code> <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>Flattened, whitespace-normalized text content.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link  <code>staticmethod</code>","text":"<pre><code>parse_link(a: Tag) -&gt; Optional[str]\n</code></pre> <p>Extract the hyperlink reference from an <code>&lt;a&gt;</code> element.</p> <p>Parameters:</p> Name Type Description Default <code>a</code> <code>Tag</code> <p>BeautifulSoup tag representing an anchor.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>Optional[str]: The value of the <code>href</code> attribute, or None if absent.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"<pre><code>parse_meta() -&gt; dict[str, Any]\n</code></pre> <p>Extract high-level metadata from the HTML document.</p> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>dict[str, Any]: Dictionary containing extracted metadata.</p> Notes <p>Responsibilities:</p> <pre><code>- Extract high-level metadata from the HTML document.\n- This includes: Document title, `&lt;meta&gt;` tag name/property to\n  content mappings.\n</code></pre>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table  <code>staticmethod</code>","text":"<pre><code>parse_table(table: Tag) -&gt; list[list[str]]\n</code></pre> <p>Parse an HTML table into a 2D list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Tag</code> <p>BeautifulSoup tag representing a <code>&lt;table&gt;</code>.</p> required <p>Returns:</p> Type Description <code>list[list[str]]</code> <p>list[list[str]]: A list of rows, where each row is a list of cell text values.</p>"},{"location":"html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"html/scraper/","title":"Scraper","text":""},{"location":"html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":""},{"location":"html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"<p>HTML scraping implementation for OmniRead.</p> <p>This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core <code>BaseScraper</code> contract using <code>httpx</code> as the transport layer.</p> <p>This scraper is responsible for:</p> <ul> <li>Fetching raw HTML bytes over HTTP(S)</li> <li>Validating response content type</li> <li>Attaching HTTP metadata to the returned content</li> </ul> <p>This scraper is not responsible for:</p> <ul> <li>Parsing or interpreting HTML</li> <li>Retrying failed requests</li> <li>Managing crawl policies or rate limiting</li> </ul>"},{"location":"html/scraper/#omniread.html.scraper-classes","title":"Classes","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"<pre><code>HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Base HTML scraper using <code>httpx</code>.</p> Notes <p>Responsibilities:</p> <pre><code>- This scraper retrieves HTML documents over HTTP(S) and returns\n  them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n  HTML content type, and preserves HTTP response metadata.\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not: Parse HTML, perform retries or backoff,\n  handle non-HTML responses.\n</code></pre> <p>Initialize the HTML scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>Client | None</code> <p>Optional pre-configured <code>httpx.Client</code>. If omitted, a client is created internally.</p> <code>None</code> <code>timeout</code> <code>float</code> <p>Request timeout in seconds.</p> <code>15.0</code> <code>headers</code> <code>Optional[Mapping[str, str]]</code> <p>Optional default HTTP headers.</p> <code>None</code> <code>follow_redirects</code> <code>bool</code> <p>Whether to follow HTTP redirects.</p> <code>True</code>"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch an HTML document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>str</code> <p>URL of the HTML document.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to be merged into the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.</p> <p>Raises:</p> Type Description <code>HTTPError</code> <p>If the HTTP request fails.</p> <code>ValueError</code> <p>If the response is not valid HTML.</p>"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"<pre><code>validate_content_type(response: httpx.Response) -&gt; None\n</code></pre> <p>Validate that the HTTP response contains HTML content.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Response</code> <p>HTTP response returned by <code>httpx</code>.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>Content-Type</code> header is missing or does not indicate HTML content.</p>"},{"location":"pdf/","title":"Pdf","text":""},{"location":"pdf/#omniread.pdf","title":"omniread.pdf","text":""},{"location":"pdf/#omniread.pdf--summary","title":"Summary","text":"<p>PDF format implementation for OmniRead.</p> <p>This package provides PDF-specific implementations of the core OmniRead contracts defined in <code>omniread.core</code>.</p> <p>Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes:</p> <ul> <li>PDF clients for acquiring raw PDF data.</li> <li>PDF scrapers that coordinate client access.</li> <li>PDF parsers that extract structured content from PDF binaries.</li> </ul> <p>Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.</p>"},{"location":"pdf/#omniread.pdf--public-api","title":"Public API","text":"<ul> <li><code>FileSystemPDFClient</code></li> <li><code>PDFScraper</code></li> <li><code>PDFParser</code></li> </ul>"},{"location":"pdf/#omniread.pdf-classes","title":"Classes","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p>               Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns\n  their raw binary contents.\n</code></pre>"},{"location":"pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides\n  the extension point for implementing concrete PDF parsing strategies.\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must define the output type `T` and\n  implement the `parse()` method.\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and\n  return a deterministic, structured output.\n</code></pre>"},{"location":"pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output\n  into `Content`.\n- Preserves caller-provided metadata.\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"},{"location":"pdf/client/","title":"Client","text":""},{"location":"pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":""},{"location":"pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"<p>PDF client abstractions for OmniRead.</p> <p>This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.</p> <p>Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.</p> <p>Typical backing stores include:</p> <ul> <li>Local filesystems</li> <li>Object storage (S3, GCS, etc.)</li> <li>Network file systems</li> </ul>"},{"location":"pdf/client/#omniread.pdf.client-classes","title":"Classes","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"<p>               Bases: <code>ABC</code></p> <p>Abstract client responsible for retrieving PDF bytes.</p> <p>Retrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must accept a source identifier appropriate to\n  the backing store.\n- Return the full PDF binary payload.\n- Raise retrieval-specific errors on failure.\n</code></pre>"},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch  <code>abstractmethod</code>","text":"<pre><code>fetch(source: Any) -&gt; bytes\n</code></pre> <p>Fetch raw PDF bytes from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF location, such as a file path, object storage key, or remote reference.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors defined by the implementation.</p>"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"<p>               Bases: <code>BasePDFClient</code></p> <p>PDF client that reads from the local filesystem.</p> Notes <p>Guarantees:</p> <pre><code>- This client reads PDF files directly from the disk and returns\n  their raw binary contents.\n</code></pre>"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"<pre><code>fetch(path: Path) -&gt; bytes\n</code></pre> <p>Read a PDF file from the local filesystem.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>Filesystem path to the PDF file.</p> required <p>Returns:</p> Name Type Description <code>bytes</code> <code>bytes</code> <p>Raw PDF bytes.</p> <p>Raises:</p> Type Description <code>FileNotFoundError</code> <p>If the path does not exist.</p> <code>ValueError</code> <p>If the path exists but is not a file.</p>"},{"location":"pdf/parser/","title":"Parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"<p>PDF parser base implementations for OmniRead.</p> <p>This module defines the PDF-specific parser contract, extending the format-agnostic <code>BaseParser</code> with constraints appropriate for PDF content.</p> <p>PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.</p>"},{"location":"pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"<pre><code>PDFParser(content: Content)\n</code></pre> <p>               Bases: <code>BaseParser[T]</code>, <code>Generic[T]</code></p> <p>Base PDF parser.</p> Notes <p>Responsibilities:</p> <pre><code>- This class enforces PDF content-type compatibility and provides\n  the extension point for implementing concrete PDF parsing strategies.\n</code></pre> <p>Constraints:</p> <pre><code>- Concrete implementations must define the output type `T` and\n  implement the `parse()` method.\n</code></pre> <p>Initialize the parser with content to be parsed.</p> <p>Parameters:</p> Name Type Description Default <code>content</code> <code>Content</code> <p>Content instance to be parsed.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the content type is not supported by this parser.</p>"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>supported_types: set[ContentType] = {PDF}\n</code></pre> <p>Set of content types supported by this parser (PDF only).</p>"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse  <code>abstractmethod</code>","text":"<pre><code>parse() -&gt; T\n</code></pre> <p>Parse PDF content into a structured output.</p> <p>Returns:</p> Name Type Description <code>T</code> <code>T</code> <p>Parsed representation of type <code>T</code>.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Parsing-specific errors as defined by the implementation.</p> Notes <p>Responsibilities:</p> <pre><code>- Implementations must fully interpret the PDF binary payload and\n  return a deterministic, structured output.\n</code></pre>"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"<pre><code>supports() -&gt; bool\n</code></pre> <p>Check whether this parser supports the content's type.</p> <p>Returns:</p> Name Type Description <code>bool</code> <code>bool</code> <p>True if the content type is supported; False otherwise.</p>"},{"location":"pdf/scraper/","title":"Scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"<p>PDF scraping implementation for OmniRead.</p> <p>This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a <code>Content</code> object.</p> <p>The scraper implements the core <code>BaseScraper</code> contract while delegating all storage and access concerns to a <code>BasePDFClient</code> implementation.</p>"},{"location":"pdf/scraper/#omniread.pdf.scraper-classes","title":"Classes","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"<pre><code>PDFScraper(*, client: BasePDFClient)\n</code></pre> <p>               Bases: <code>BaseScraper</code></p> <p>Scraper for PDF sources.</p> Notes <p>Responsibilities:</p> <pre><code>- Delegates byte retrieval to a PDF client and normalizes output\n  into `Content`.\n- Preserves caller-provided metadata.\n</code></pre> <p>Constraints:</p> <pre><code>- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n</code></pre> <p>Initialize the PDF scraper.</p> <p>Parameters:</p> Name Type Description Default <code>client</code> <code>BasePDFClient</code> <p>PDF client responsible for retrieving raw PDF bytes.</p> required"},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"<pre><code>fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -&gt; Content\n</code></pre> <p>Fetch a PDF document from the given source.</p> <p>Parameters:</p> Name Type Description Default <code>source</code> <code>Any</code> <p>Identifier of the PDF source as understood by the configured PDF client.</p> required <code>metadata</code> <code>Optional[Mapping[str, Any]]</code> <p>Optional metadata to attach to the returned content.</p> <code>None</code> <p>Returns:</p> Name Type Description <code>Content</code> <code>Content</code> <p>A <code>Content</code> instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.</p> <p>Raises:</p> Type Description <code>Exception</code> <p>Retrieval-specific errors raised by the PDF client.</p>"}]}
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink