updated mcp
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2026-03-08 17:57:34 +05:30
parent 9191de9dff
commit 0e49f02c4c
167 changed files with 7632 additions and 98942 deletions

View File

@@ -1082,9 +1082,8 @@
<div class="doc doc-contents first">
<p>Canonical content models for OmniRead.</p>
<hr />
<h4 id="omniread.core.content--summary">Summary</h4>
<h3 id="omniread.core.content--summary">Summary</h3>
<p>Canonical content models for OmniRead.</p>
<p>This module defines the <strong>format-agnostic content representation</strong> used across
all parsers and scrapers in OmniRead.</p>
<p>The models defined here represent <em>what</em> was extracted, not <em>how</em> it was
@@ -1128,8 +1127,12 @@ the semantic meaning of these models.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with
minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
parsers, and downstream consumers.
</code></pre></div></td></tr></table></div>
</details>
@@ -1270,8 +1273,12 @@ the semantic meaning of these models.</p>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the content source
- It is primarily used for routing content to the appropriate parser or downstream consumer
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the
content source.
- It is primarily used for routing content to the appropriate
parser or downstream consumer.
</code></pre></div></td></tr></table></div>
</details>

View File

@@ -989,25 +989,26 @@
<div class="doc doc-contents first">
<p>Core domain contracts for OmniRead.</p>
<hr />
<h4 id="omniread.core--summary">Summary</h4>
<h3 id="omniread.core--summary">Summary</h3>
<p>Core domain contracts for OmniRead.</p>
<p>This package defines the <strong>format-agnostic domain layer</strong> of OmniRead.
It exposes canonical content models and abstract interfaces that are
implemented by format-specific modules (HTML, PDF, etc.).</p>
<p>Public exports from this package are considered <strong>stable contracts</strong> and
are safe for downstream consumers to depend on.</p>
<p>Submodules:
- content: Canonical content models and enums
- parser: Abstract parsing contracts
- scraper: Abstract scraping contracts</p>
<p>Submodules:</p>
<ul>
<li><code>content</code>: Canonical content models and enums.</li>
<li><code>parser</code>: Abstract parsing contracts.</li>
<li><code>scraper</code>: Abstract scraping contracts.</li>
</ul>
<p>Format-specific behavior must not be introduced at this layer.</p>
<hr />
<h4 id="omniread.core--public-api">Public API</h4>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>Content
ContentType
</code></pre></div></td></tr></table></div>
<h3 id="omniread.core--public-api">Public API</h3>
<ul>
<li><code>Content</code></li>
<li><code>ContentType</code></li>
</ul>
<hr />
@@ -1045,15 +1046,19 @@ ContentType
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A parser is a self-contained object that owns the Content it is responsible for interpreting
- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A parser is a self-contained object that owns the `Content` it is
responsible for interpreting.
- Consumers may rely on early validation of content compatibility
and type-stable return values from `parse()`.
</code></pre></div></td></tr></table></div>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must declare supported content types via `supported_types`
- Implementations must raise parsing-specific exceptions from `parse()`
- Implementations must remain deterministic for a given input
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must declare supported content types via `supported_types`.
- Implementations must raise parsing-specific exceptions from `parse()`.
- Implementations must remain deterministic for a given input.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the parser with content to be parsed.</p>
@@ -1073,7 +1078,7 @@ ContentType
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1216,7 +1221,9 @@ ContentType
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully consume the provided content and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully consume the provided content and
return a deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -1298,13 +1305,21 @@ ContentType
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it
- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object
- Scrapers define how content is obtained, not what the content means
- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><code>- A scraper is responsible ONLY for fetching raw content (bytes)
from a source. It must not interpret or parse it.
- A scraper is a stateless acquisition component that retrieves raw
content from a source and returns it as a `Content` object.
- Scrapers define how content is obtained, not what the content means.
- Implementations may vary in transport mechanism, authentication
strategy, retry and backoff behavior.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must not parse content, modify content semantics,
or couple scraping logic to a specific parser.
</code></pre></div></td></tr></table></div>
</details>
@@ -1358,7 +1373,7 @@ ContentType
</td>
<td>
<div class="doc-md-description">
<p>Location identifier (URL, file path, S3 URI, etc.)</p>
<p>Location identifier (URL, file path, S3 URI, etc.).</p>
</div>
</td>
<td>
@@ -1372,7 +1387,7 @@ ContentType
</td>
<td>
<div class="doc-md-description">
<p>Optional hints for the scraper (headers, auth, etc.)</p>
<p>Optional hints for the scraper (headers, auth, etc.).</p>
</div>
</td>
<td>
@@ -1394,7 +1409,7 @@ ContentType
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1432,7 +1447,9 @@ ContentType
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must retrieve the content referenced by `source`
and return it as raw bytes wrapped in a `Content` object.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -1473,8 +1490,12 @@ ContentType
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with
minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
parsers, and downstream consumers.
</code></pre></div></td></tr></table></div>
</details>
@@ -1615,8 +1636,12 @@ ContentType
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the content source
- It is primarily used for routing content to the appropriate parser or downstream consumer
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the
content source.
- It is primarily used for routing content to the appropriate
parser or downstream consumer.
</code></pre></div></td></tr></table></div>
</details>

View File

@@ -962,19 +962,22 @@
<div class="doc doc-contents first">
<p>Abstract parsing contracts for OmniRead.</p>
<hr />
<h4 id="omniread.core.parser--summary">Summary</h4>
<h3 id="omniread.core.parser--summary">Summary</h3>
<p>Abstract parsing contracts for OmniRead.</p>
<p>This module defines the <strong>format-agnostic parser interface</strong> used to transform
raw content into structured, typed representations.</p>
<p>Parsers are responsible for:
- Interpreting a single <code>Content</code> instance
- Validating compatibility with the content type
- Producing a structured output suitable for downstream consumers</p>
<p>Parsers are not responsible for:
- Fetching or acquiring content
- Performing retries or error recovery
- Managing multiple content sources</p>
<p>Parsers are responsible for:</p>
<ul>
<li>Interpreting a single <code>Content</code> instance</li>
<li>Validating compatibility with the content type</li>
<li>Producing a structured output suitable for downstream consumers</li>
</ul>
<p>Parsers are not responsible for:</p>
<ul>
<li>Fetching or acquiring content</li>
<li>Performing retries or error recovery</li>
<li>Managing multiple content sources</li>
</ul>
@@ -1011,15 +1014,19 @@ raw content into structured, typed representations.</p>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A parser is a self-contained object that owns the Content it is responsible for interpreting
- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A parser is a self-contained object that owns the `Content` it is
responsible for interpreting.
- Consumers may rely on early validation of content compatibility
and type-stable return values from `parse()`.
</code></pre></div></td></tr></table></div>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must declare supported content types via `supported_types`
- Implementations must raise parsing-specific exceptions from `parse()`
- Implementations must remain deterministic for a given input
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must declare supported content types via `supported_types`.
- Implementations must raise parsing-specific exceptions from `parse()`.
- Implementations must remain deterministic for a given input.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the parser with content to be parsed.</p>
@@ -1039,7 +1046,7 @@ raw content into structured, typed representations.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1182,7 +1189,9 @@ raw content into structured, typed representations.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully consume the provided content and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully consume the provided content and
return a deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>

View File

@@ -896,19 +896,22 @@
<div class="doc doc-contents first">
<p>Abstract scraping contracts for OmniRead.</p>
<hr />
<h4 id="omniread.core.scraper--summary">Summary</h4>
<h3 id="omniread.core.scraper--summary">Summary</h3>
<p>Abstract scraping contracts for OmniRead.</p>
<p>This module defines the <strong>format-agnostic scraper interface</strong> responsible for
acquiring raw content from external sources.</p>
<p>Scrapers are responsible for:
- Locating and retrieving raw content bytes
- Attaching minimal contextual metadata
- Returning normalized <code>Content</code> objects</p>
<p>Scrapers are explicitly NOT responsible for:
- Parsing or interpreting content
- Inferring structure or semantics
- Performing content-type specific processing</p>
<p>Scrapers are responsible for:</p>
<ul>
<li>Locating and retrieving raw content bytes</li>
<li>Attaching minimal contextual metadata</li>
<li>Returning normalized <code>Content</code> objects</li>
</ul>
<p>Scrapers are explicitly NOT responsible for:</p>
<ul>
<li>Parsing or interpreting content</li>
<li>Inferring structure or semantics</li>
<li>Performing content-type specific processing</li>
</ul>
<p>All interpretation must be delegated to parsers.</p>
@@ -947,13 +950,21 @@ acquiring raw content from external sources.</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it
- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object
- Scrapers define how content is obtained, not what the content means
- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><code>- A scraper is responsible ONLY for fetching raw content (bytes)
from a source. It must not interpret or parse it.
- A scraper is a stateless acquisition component that retrieves raw
content from a source and returns it as a `Content` object.
- Scrapers define how content is obtained, not what the content means.
- Implementations may vary in transport mechanism, authentication
strategy, retry and backoff behavior.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must not parse content, modify content semantics,
or couple scraping logic to a specific parser.
</code></pre></div></td></tr></table></div>
</details>
@@ -1007,7 +1018,7 @@ acquiring raw content from external sources.</p>
</td>
<td>
<div class="doc-md-description">
<p>Location identifier (URL, file path, S3 URI, etc.)</p>
<p>Location identifier (URL, file path, S3 URI, etc.).</p>
</div>
</td>
<td>
@@ -1021,7 +1032,7 @@ acquiring raw content from external sources.</p>
</td>
<td>
<div class="doc-md-description">
<p>Optional hints for the scraper (headers, auth, etc.)</p>
<p>Optional hints for the scraper (headers, auth, etc.).</p>
</div>
</td>
<td>
@@ -1043,7 +1054,7 @@ acquiring raw content from external sources.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1081,7 +1092,9 @@ acquiring raw content from external sources.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must retrieve the content referenced by `source`
and return it as raw bytes wrapped in a `Content` object.
</code></pre></div></td></tr></table></div>
</details>
</div>

View File

@@ -902,26 +902,29 @@
<div class="doc doc-contents first">
<p>HTML format implementation for OmniRead.</p>
<hr />
<h4 id="omniread.html--summary">Summary</h4>
<h3 id="omniread.html--summary">Summary</h3>
<p>HTML format implementation for OmniRead.</p>
<p>This package provides <strong>HTML-specific implementations</strong> of the core OmniRead
contracts defined in <code>omniread.core</code>.</p>
<p>It includes:
- HTML parsers that interpret HTML content
- HTML scrapers that retrieve HTML documents</p>
<p>This package:
- Implements, but does not redefine, core contracts
- May contain HTML-specific behavior and edge-case handling
- Produces canonical content models defined in <code>omniread.core.content</code></p>
<p>It includes:</p>
<ul>
<li>HTML parsers that interpret HTML content.</li>
<li>HTML scrapers that retrieve HTML documents.</li>
</ul>
<p>Key characteristics:</p>
<ul>
<li>Implements, but does not redefine, core contracts.</li>
<li>May contain HTML-specific behavior and edge-case handling.</li>
<li>Produces canonical content models defined in <code>omniread.core.content</code>.</li>
</ul>
<p>Consumers should depend on <code>omniread.core</code> interfaces wherever possible and
use this package only when HTML-specific behavior is required.</p>
<hr />
<h4 id="omniread.html--public-api">Public API</h4>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>HTMLScraper
HTMLParser
</code></pre></div></td></tr></table></div>
<h3 id="omniread.html--public-api">Public API</h3>
<ul>
<li><code>HTMLScraper</code></li>
<li><code>HTMLParser</code></li>
</ul>
<hr />
@@ -949,7 +952,7 @@ HTMLParser
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
<p>Base HTML parser.</p>
@@ -959,14 +962,24 @@ HTMLParser
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
explicitly define the return type.
</code></pre></div></td></tr></table></div>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement
the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML parser.</p>
@@ -986,7 +999,7 @@ HTMLParser
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1120,7 +1133,9 @@ HTMLParser
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a
deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -1336,8 +1351,10 @@ Dictionary containing extracted metadata.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document
- This includes: Document title, `&lt;meta&gt;` tag name/property → content mappings
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document.
- This includes: Document title, `&lt;meta&gt;` tag name/property to
content mappings.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -1484,21 +1501,29 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Base HTML scraper using httpx.</p>
<p>Base HTML scraper using <code>httpx</code>.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns
them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
HTML content type, and preserves HTTP response metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff,
handle non-HTML responses.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML scraper.</p>
@@ -1657,7 +1682,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">

View File

@@ -1034,15 +1034,16 @@
<div class="doc doc-contents first">
<p>HTML parser base implementations for OmniRead.</p>
<hr />
<h4 id="omniread.html.parser--summary">Summary</h4>
<h3 id="omniread.html.parser--summary">Summary</h3>
<p>HTML parser base implementations for OmniRead.</p>
<p>This module provides reusable HTML parsing utilities built on top of
the abstract parser contracts defined in <code>omniread.core.parser</code>.</p>
<p>It supplies:
- Content-type enforcement for HTML inputs
- BeautifulSoup initialization and lifecycle management
- Common helper methods for extracting structured data from HTML elements</p>
<p>It supplies:</p>
<ul>
<li>Content-type enforcement for HTML inputs</li>
<li>BeautifulSoup initialization and lifecycle management</li>
<li>Common helper methods for extracting structured data from HTML elements</li>
</ul>
<p>Concrete parsers must subclass <code>HTMLParser</code> and implement the <code>parse()</code> method
to return a structured representation appropriate for their use case.</p>
@@ -1071,7 +1072,7 @@ to return a structured representation appropriate for their use case.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../../omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../../core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
<p>Base HTML parser.</p>
@@ -1081,14 +1082,24 @@ to return a structured representation appropriate for their use case.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
explicitly define the return type.
</code></pre></div></td></tr></table></div>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement
the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML parser.</p>
@@ -1108,7 +1119,7 @@ to return a structured representation appropriate for their use case.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1242,7 +1253,9 @@ to return a structured representation appropriate for their use case.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a
deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -1458,8 +1471,10 @@ Dictionary containing extracted metadata.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document
- This includes: Document title, `&lt;meta&gt;` tag name/property → content mappings
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document.
- This includes: Document title, `&lt;meta&gt;` tag name/property to
content mappings.
</code></pre></div></td></tr></table></div>
</details>
</div>

View File

@@ -914,20 +914,23 @@
<div class="doc doc-contents first">
<p>HTML scraping implementation for OmniRead.</p>
<hr />
<h4 id="omniread.html.scraper--summary">Summary</h4>
<h3 id="omniread.html.scraper--summary">Summary</h3>
<p>HTML scraping implementation for OmniRead.</p>
<p>This module provides an HTTP-based scraper for retrieving HTML documents.
It implements the core <code>BaseScraper</code> contract using <code>httpx</code> as the transport
layer.</p>
<p>This scraper is responsible for:
- Fetching raw HTML bytes over HTTP(S)
- Validating response content type
- Attaching HTTP metadata to the returned content</p>
<p>This scraper is not responsible for:
- Parsing or interpreting HTML
- Retrying failed requests
- Managing crawl policies or rate limiting</p>
<p>This scraper is responsible for:</p>
<ul>
<li>Fetching raw HTML bytes over HTTP(S)</li>
<li>Validating response content type</li>
<li>Attaching HTTP metadata to the returned content</li>
</ul>
<p>This scraper is not responsible for:</p>
<ul>
<li>Parsing or interpreting HTML</li>
<li>Retrying failed requests</li>
<li>Managing crawl policies or rate limiting</li>
</ul>
@@ -954,21 +957,29 @@ layer.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../../omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../../core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Base HTML scraper using httpx.</p>
<p>Base HTML scraper using <code>httpx</code>.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns
them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
HTML content type, and preserves HTTP response metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff,
handle non-HTML responses.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML scraper.</p>
@@ -1127,7 +1138,7 @@ layer.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">

View File

@@ -298,7 +298,8 @@
</span>
</a>
</li>
<nav class="md-nav" aria-label="Installation">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#omniread--quick-start" class="md-nav__link">
@@ -307,6 +308,11 @@
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -316,6 +322,15 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#omniread--core-philosophy" class="md-nav__link">
<span class="md-ellipsis">
Core Philosophy
</span>
</a>
</li>
<li class="md-nav__item">
@@ -1237,7 +1252,8 @@
</span>
</a>
</li>
<nav class="md-nav" aria-label="Installation">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#omniread--quick-start" class="md-nav__link">
@@ -1246,6 +1262,11 @@
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -1255,6 +1276,15 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#omniread--core-philosophy" class="md-nav__link">
<span class="md-ellipsis">
Core Philosophy
</span>
</a>
</li>
<li class="md-nav__item">
@@ -1746,105 +1776,118 @@
<div class="doc doc-contents first">
<p>OmniRead — format-agnostic content acquisition and parsing framework.</p>
<hr />
<h4 id="omniread--summary">Summary</h4>
<p>OmniRead provides a <strong>cleanly layered architecture</strong> for fetching, parsing,
<h3 id="omniread--summary">Summary</h3>
<p><code>OmniRead</code> — format-agnostic content acquisition and parsing framework.</p>
<p><code>OmniRead</code> provides a <strong>cleanly layered architecture</strong> for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.</p>
<p>The library is structured around three core concepts:</p>
<ol>
<li><strong>Content</strong>: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.</li>
<li><strong>Scrapers</strong>: Components responsible for <em>acquiring</em> raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.</li>
<li><strong>Parsers</strong>: Components responsible for <em>interpreting</em> acquired content and converting it into structured, typed representations.</li>
<li><strong><code>Content</code></strong>: A canonical, format-agnostic container representing raw content
bytes and minimal contextual metadata.</li>
<li><strong><code>Scrapers</code></strong>: Components responsible for <em>acquiring</em> raw content from a
source (HTTP, filesystem, object storage, etc.). <code>Scrapers</code> never interpret
content.</li>
<li><strong><code>Parsers</code></strong>: Components responsible for <em>interpreting</em> acquired content and
converting it into structured, typed representations.</li>
</ol>
<p>OmniRead deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation
- Replaceable implementations per format
- Predictable, testable behavior</p>
<hr />
<h4 id="omniread--installation">Installation</h4>
<p>Install OmniRead using pip:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>pip install omniread
</code></pre></div></td></tr></table></div>
<p>Or with Poetry:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>poetry add omniread
</code></pre></div></td></tr></table></div>
<p><code>OmniRead</code> deliberately separates these responsibilities to ensure:</p>
<ul>
<li>Clear boundaries between IO and interpretation.</li>
<li>Replaceable implementations per format.</li>
<li>Predictable, testable behavior.</li>
</ul>
<h3 id="omniread--installation">Installation</h3>
<p>Install <code>OmniRead</code> using pip:</p>
<div class="language-bash highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-0-1">1</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-0-1"><a id="__codelineno-0-1" name="__codelineno-0-1"></a>pip<span class="w"> </span>install<span class="w"> </span>omniread
</span></code></pre></div></td></tr></table></div>
<p>Install OmniRead using Poetry:
<div class="language-bash highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-1-1">1</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-1-1"><a id="__codelineno-1-1" name="__codelineno-1-1"></a>poetry<span class="w"> </span>add<span class="w"> </span>omniread
</span></code></pre></div></td></tr></table></div></p>
<hr />
<h4 id="omniread--quick-start">Quick start</h4>
<p>HTML example:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span></pre></div></td><td class="code"><div><pre><span></span><code>from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch(&quot;https://example.com&quot;)
class TitleParser(HTMLParser[str]):
def parse(self) -&gt; str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
</code></pre></div></td></tr></table></div>
<p>PDF example:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span></pre></div></td><td class="code"><div><pre><span></span><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path(&quot;document.pdf&quot;))
class TextPDFParser(PDFParser[str]):
def parse(self) -&gt; str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
</code></pre></div></td></tr></table></div>
<hr />
<h4 id="omniread--public-api">Public API</h4>
<details class="example" open>
<summary>Example</summary>
<p>HTML example:
<div class="language-python highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-0-1"> 1</a></span>
<span class="normal"><a href="#__codelineno-0-2"> 2</a></span>
<span class="normal"><a href="#__codelineno-0-3"> 3</a></span>
<span class="normal"><a href="#__codelineno-0-4"> 4</a></span>
<span class="normal"><a href="#__codelineno-0-5"> 5</a></span>
<span class="normal"><a href="#__codelineno-0-6"> 6</a></span>
<span class="normal"><a href="#__codelineno-0-7"> 7</a></span>
<span class="normal"><a href="#__codelineno-0-8"> 8</a></span>
<span class="normal"><a href="#__codelineno-0-9"> 9</a></span>
<span class="normal"><a href="#__codelineno-0-10">10</a></span>
<span class="normal"><a href="#__codelineno-0-11">11</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-0-1"><a id="__codelineno-0-1" name="__codelineno-0-1"></a><span class="kn">from</span><span class="w"> </span><span class="nn">omniread</span><span class="w"> </span><span class="kn">import</span> <span class="n">HTMLScraper</span><span class="p">,</span> <span class="n">HTMLParser</span>
</span><span id="__span-0-2"><a id="__codelineno-0-2" name="__codelineno-0-2"></a>
</span><span id="__span-0-3"><a id="__codelineno-0-3" name="__codelineno-0-3"></a><span class="n">scraper</span> <span class="o">=</span> <span class="n">HTMLScraper</span><span class="p">()</span>
</span><span id="__span-0-4"><a id="__codelineno-0-4" name="__codelineno-0-4"></a><span class="n">content</span> <span class="o">=</span> <span class="n">scraper</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="s2">&quot;https://example.com&quot;</span><span class="p">)</span>
</span><span id="__span-0-5"><a id="__codelineno-0-5" name="__codelineno-0-5"></a>
</span><span id="__span-0-6"><a id="__codelineno-0-6" name="__codelineno-0-6"></a><span class="k">class</span><span class="w"> </span><span class="nc">TitleParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
</span><span id="__span-0-7"><a id="__codelineno-0-7" name="__codelineno-0-7"></a> <span class="k">def</span><span class="w"> </span><span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span><span id="__span-0-8"><a id="__codelineno-0-8" name="__codelineno-0-8"></a> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
</span><span id="__span-0-9"><a id="__codelineno-0-9" name="__codelineno-0-9"></a>
</span><span id="__span-0-10"><a id="__codelineno-0-10" name="__codelineno-0-10"></a><span class="n">parser</span> <span class="o">=</span> <span class="n">TitleParser</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</span><span id="__span-0-11"><a id="__codelineno-0-11" name="__codelineno-0-11"></a><span class="n">title</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
</span></code></pre></div></td></tr></table></div></p>
<p>PDF example:
<div class="language-python highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-1-1"> 1</a></span>
<span class="normal"><a href="#__codelineno-1-2"> 2</a></span>
<span class="normal"><a href="#__codelineno-1-3"> 3</a></span>
<span class="normal"><a href="#__codelineno-1-4"> 4</a></span>
<span class="normal"><a href="#__codelineno-1-5"> 5</a></span>
<span class="normal"><a href="#__codelineno-1-6"> 6</a></span>
<span class="normal"><a href="#__codelineno-1-7"> 7</a></span>
<span class="normal"><a href="#__codelineno-1-8"> 8</a></span>
<span class="normal"><a href="#__codelineno-1-9"> 9</a></span>
<span class="normal"><a href="#__codelineno-1-10">10</a></span>
<span class="normal"><a href="#__codelineno-1-11">11</a></span>
<span class="normal"><a href="#__codelineno-1-12">12</a></span>
<span class="normal"><a href="#__codelineno-1-13">13</a></span>
<span class="normal"><a href="#__codelineno-1-14">14</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-1-1"><a id="__codelineno-1-1" name="__codelineno-1-1"></a><span class="kn">from</span><span class="w"> </span><span class="nn">omniread</span><span class="w"> </span><span class="kn">import</span> <span class="n">FileSystemPDFClient</span><span class="p">,</span> <span class="n">PDFScraper</span><span class="p">,</span> <span class="n">PDFParser</span>
</span><span id="__span-1-2"><a id="__codelineno-1-2" name="__codelineno-1-2"></a><span class="kn">from</span><span class="w"> </span><span class="nn">pathlib</span><span class="w"> </span><span class="kn">import</span> <span class="n">Path</span>
</span><span id="__span-1-3"><a id="__codelineno-1-3" name="__codelineno-1-3"></a>
</span><span id="__span-1-4"><a id="__codelineno-1-4" name="__codelineno-1-4"></a><span class="n">client</span> <span class="o">=</span> <span class="n">FileSystemPDFClient</span><span class="p">()</span>
</span><span id="__span-1-5"><a id="__codelineno-1-5" name="__codelineno-1-5"></a><span class="n">scraper</span> <span class="o">=</span> <span class="n">PDFScraper</span><span class="p">(</span><span class="n">client</span><span class="o">=</span><span class="n">client</span><span class="p">)</span>
</span><span id="__span-1-6"><a id="__codelineno-1-6" name="__codelineno-1-6"></a><span class="n">content</span> <span class="o">=</span> <span class="n">scraper</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">&quot;document.pdf&quot;</span><span class="p">))</span>
</span><span id="__span-1-7"><a id="__codelineno-1-7" name="__codelineno-1-7"></a>
</span><span id="__span-1-8"><a id="__codelineno-1-8" name="__codelineno-1-8"></a><span class="k">class</span><span class="w"> </span><span class="nc">TextPDFParser</span><span class="p">(</span><span class="n">PDFParser</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
</span><span id="__span-1-9"><a id="__codelineno-1-9" name="__codelineno-1-9"></a> <span class="k">def</span><span class="w"> </span><span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span><span id="__span-1-10"><a id="__codelineno-1-10" name="__codelineno-1-10"></a> <span class="c1"># implement PDF text extraction</span>
</span><span id="__span-1-11"><a id="__codelineno-1-11" name="__codelineno-1-11"></a> <span class="o">...</span>
</span><span id="__span-1-12"><a id="__codelineno-1-12" name="__codelineno-1-12"></a>
</span><span id="__span-1-13"><a id="__codelineno-1-13" name="__codelineno-1-13"></a><span class="n">parser</span> <span class="o">=</span> <span class="n">TextPDFParser</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</span><span id="__span-1-14"><a id="__codelineno-1-14" name="__codelineno-1-14"></a><span class="n">result</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
</span></code></pre></div></td></tr></table></div></p>
</details> <hr />
<h3 id="omniread--public-api">Public API</h3>
<p>This module re-exports the <strong>recommended public entry points</strong> of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.</p>
<p><strong>Core:</strong>
- Content
- ContentType</p>
<p><strong>HTML:</strong>
- HTMLScraper
- HTMLParser</p>
<p><strong>PDF:</strong>
- FileSystemPDFClient
- PDFScraper
- PDFParser</p>
<p><strong>Core Philosophy:</strong>
<code>OmniRead</code> is designed as a <strong>decoupled content engine</strong>:
1. <strong>Separation of Concerns</strong>: Scrapers <em>fetch</em>, Parsers <em>interpret</em>. Neither knows about the other.
2. <strong>Normalized Exchange</strong>: All components communicate via the <code>Content</code> model, ensuring a consistent contract.
3. <strong>Format Agnosticism</strong>: The core logic is independent of whether the input is HTML, PDF, or JSON.</p>
<ul>
<li><code>Content</code>: Canonical content model.</li>
<li><code>ContentType</code>: Supported media types.</li>
<li><code>HTMLScraper</code>: HTTP-based HTML acquisition.</li>
<li><code>HTMLParser</code>: Base parser for HTML DOM interpretation.</li>
<li><code>FileSystemPDFClient</code>: Local filesystem PDF access.</li>
<li><code>PDFScraper</code>: PDF-specific content acquisition.</li>
<li><code>PDFParser</code>: Base parser for PDF binary interpretation.</li>
</ul>
<hr />
<h3 id="omniread--core-philosophy">Core Philosophy</h3>
<p><code>OmniRead</code> is designed as a <strong>decoupled content engine</strong>:</p>
<ol>
<li><strong>Separation of Concerns</strong>: Scrapers <em>fetch</em>, Parsers <em>interpret</em>. Neither
knows about the other.</li>
<li><strong>Normalized Exchange</strong>: All components communicate via the <code>Content</code> model,
ensuring a consistent contract.</li>
<li><strong>Format Agnosticism</strong>: The core logic is independent of whether the input
is HTML, PDF, or JSON.</li>
</ol>
<hr />
@@ -1884,8 +1927,12 @@ required.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with
minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
parsers, and downstream consumers.
</code></pre></div></td></tr></table></div>
</details>
@@ -2026,8 +2073,12 @@ required.</p>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the content source
- It is primarily used for routing content to the appropriate parser or downstream consumer
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the
content source.
- It is primarily used for routing content to the appropriate
parser or downstream consumer.
</code></pre></div></td></tr></table></div>
</details>
@@ -2169,7 +2220,9 @@ required.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns their raw binary contents
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns
their raw binary contents.
</code></pre></div></td></tr></table></div>
</details>
@@ -2311,7 +2364,7 @@ required.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
<p>Base HTML parser.</p>
@@ -2321,14 +2374,24 @@ required.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
explicitly define the return type.
</code></pre></div></td></tr></table></div>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement
the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML parser.</p>
@@ -2348,7 +2411,7 @@ required.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -2482,7 +2545,9 @@ required.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a
deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -2698,8 +2763,10 @@ Dictionary containing extracted metadata.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document
- This includes: Document title, `&lt;meta&gt;` tag name/property → content mappings
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document.
- This includes: Document title, `&lt;meta&gt;` tag name/property to
content mappings.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -2846,21 +2913,29 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Base HTML scraper using httpx.</p>
<p>Base HTML scraper using <code>httpx</code>.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns
them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
HTML content type, and preserves HTTP response metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff,
handle non-HTML responses.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML scraper.</p>
@@ -3019,7 +3094,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -3160,7 +3235,7 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
<p>Base PDF parser.</p>
@@ -3169,10 +3244,14 @@ A list of rows, where each row is a list of cell text values.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides
the extension point for implementing concrete PDF parsing strategies.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must define the output type `T` and
implement the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the parser with content to be parsed.</p>
@@ -3192,7 +3271,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -3335,7 +3414,9 @@ A list of rows, where each row is a list of cell text values.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and
return a deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -3406,7 +3487,7 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Scraper for PDF sources.</p>
@@ -3416,11 +3497,15 @@ A list of rows, where each row is a list of cell text values.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output into Content
- Preserves caller-provided metadata
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output
into `Content`.
- Preserves caller-provided metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the PDF scraper.</p>
@@ -3537,7 +3622,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -3590,9 +3675,7 @@ A list of rows, where each row is a list of cell text values.</p>
</div>
</div><ul>
<li><a href="omniread/">Omniread</a></li>
</ul>
</div>

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -974,18 +974,19 @@
<div class="doc doc-contents first">
<p>PDF client abstractions for OmniRead.</p>
<hr />
<h4 id="omniread.pdf.client--summary">Summary</h4>
<h3 id="omniread.pdf.client--summary">Summary</h3>
<p>PDF client abstractions for OmniRead.</p>
<p>This module defines the <strong>client layer</strong> responsible for retrieving raw PDF
bytes from a concrete backing store.</p>
<p>Clients provide low-level access to PDF binaries and are intentionally
decoupled from scraping and parsing logic. They do not perform validation,
interpretation, or content extraction.</p>
<p>Typical backing stores include:
- Local filesystems
- Object storage (S3, GCS, etc.)
- Network file systems</p>
<p>Typical backing stores include:</p>
<ul>
<li>Local filesystems</li>
<li>Object storage (S3, GCS, etc.)</li>
<li>Network file systems</li>
</ul>
@@ -1014,14 +1015,20 @@ interpretation, or content extraction.</p>
Bases: <code><span title="abc.ABC">ABC</span></code></p>
<p>Abstract client responsible for retrieving PDF bytes
from a specific backing store (filesystem, S3, FTP, etc.).</p>
<p>Abstract client responsible for retrieving PDF bytes.</p>
<p>Retrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must accept a source identifier appropriate to
the backing store.
- Return the full PDF binary payload.
- Raise retrieval-specific errors on failure.
</code></pre></div></td></tr></table></div>
</details>
@@ -1165,7 +1172,9 @@ from a specific backing store (filesystem, S3, FTP, etc.).</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns their raw binary contents
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns
their raw binary contents.
</code></pre></div></td></tr></table></div>
</details>

View File

@@ -896,26 +896,26 @@
<div class="doc doc-contents first">
<p>PDF format implementation for OmniRead.</p>
<hr />
<h4 id="omniread.pdf--summary">Summary</h4>
<h3 id="omniread.pdf--summary">Summary</h3>
<p>PDF format implementation for OmniRead.</p>
<p>This package provides <strong>PDF-specific implementations</strong> of the core OmniRead
contracts defined in <code>omniread.core</code>.</p>
<p>Unlike HTML, PDF handling requires an explicit client layer for document
access. This package therefore includes:
- PDF clients for acquiring raw PDF data
- PDF scrapers that coordinate client access
- PDF parsers that extract structured content from PDF binaries</p>
access. This package therefore includes:</p>
<ul>
<li>PDF clients for acquiring raw PDF data.</li>
<li>PDF scrapers that coordinate client access.</li>
<li>PDF parsers that extract structured content from PDF binaries.</li>
</ul>
<p>Public exports from this package represent the supported PDF pipeline
and are safe for consumers to import directly when working with PDFs.</p>
<hr />
<h4 id="omniread.pdf--public-api">Public API</h4>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>FileSystemPDFClient
PDFScraper
PDFParser
</code></pre></div></td></tr></table></div>
<h3 id="omniread.pdf--public-api">Public API</h3>
<ul>
<li><code>FileSystemPDFClient</code></li>
<li><code>PDFScraper</code></li>
<li><code>PDFParser</code></li>
</ul>
<hr />
@@ -951,7 +951,9 @@ PDFParser
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns their raw binary contents
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns
their raw binary contents.
</code></pre></div></td></tr></table></div>
</details>
@@ -1093,7 +1095,7 @@ PDFParser
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
<p>Base PDF parser.</p>
@@ -1102,10 +1104,14 @@ PDFParser
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides
the extension point for implementing concrete PDF parsing strategies.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must define the output type `T` and
implement the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the parser with content to be parsed.</p>
@@ -1125,7 +1131,7 @@ PDFParser
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1268,7 +1274,9 @@ PDFParser
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and
return a deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -1339,7 +1347,7 @@ PDFParser
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Scraper for PDF sources.</p>
@@ -1349,11 +1357,15 @@ PDFParser
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output into Content
- Preserves caller-provided metadata
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output
into `Content`.
- Preserves caller-provided metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the PDF scraper.</p>
@@ -1470,7 +1482,7 @@ PDFParser
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">

View File

@@ -962,9 +962,8 @@
<div class="doc doc-contents first">
<p>PDF parser base implementations for OmniRead.</p>
<hr />
<h4 id="omniread.pdf.parser--summary">Summary</h4>
<h3 id="omniread.pdf.parser--summary">Summary</h3>
<p>PDF parser base implementations for OmniRead.</p>
<p>This module defines the <strong>PDF-specific parser contract</strong>, extending the
format-agnostic <code>BaseParser</code> with constraints appropriate for PDF content.</p>
<p>PDF parsers are responsible for interpreting binary PDF data and producing
@@ -995,7 +994,7 @@ structured representations suitable for downstream consumption.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../../omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../../core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
<p>Base PDF parser.</p>
@@ -1004,10 +1003,14 @@ structured representations suitable for downstream consumption.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides
the extension point for implementing concrete PDF parsing strategies.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must define the output type `T` and
implement the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the parser with content to be parsed.</p>
@@ -1027,7 +1030,7 @@ structured representations suitable for downstream consumption.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -1170,7 +1173,9 @@ structured representations suitable for downstream consumption.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and
return a deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>

View File

@@ -894,9 +894,8 @@
<div class="doc doc-contents first">
<p>PDF scraping implementation for OmniRead.</p>
<hr />
<h4 id="omniread.pdf.scraper--summary">Summary</h4>
<h3 id="omniread.pdf.scraper--summary">Summary</h3>
<p>PDF scraping implementation for OmniRead.</p>
<p>This module provides a PDF-specific scraper that coordinates PDF byte
retrieval via a client and normalizes the result into a <code>Content</code> object.</p>
<p>The scraper implements the core <code>BaseScraper</code> contract while delegating
@@ -927,7 +926,7 @@ all storage and access concerns to a <code>BasePDFClient</code> implementation.<
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../../omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="../../core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Scraper for PDF sources.</p>
@@ -937,11 +936,15 @@ all storage and access concerns to a <code>BasePDFClient</code> implementation.<
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output into Content
- Preserves caller-provided metadata
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output
into `Content`.
- Preserves caller-provided metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the PDF scraper.</p>
@@ -1058,7 +1061,7 @@ all storage and access concerns to a <code>BasePDFClient</code> implementation.<
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">

File diff suppressed because one or more lines are too long

Binary file not shown.