updated mcp
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2026-03-08 17:57:34 +05:30
parent 9191de9dff
commit 0e49f02c4c
167 changed files with 7632 additions and 98942 deletions

View File

@@ -298,7 +298,8 @@
</span>
</a>
</li>
<nav class="md-nav" aria-label="Installation">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#omniread--quick-start" class="md-nav__link">
@@ -307,6 +308,11 @@
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -316,6 +322,15 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#omniread--core-philosophy" class="md-nav__link">
<span class="md-ellipsis">
Core Philosophy
</span>
</a>
</li>
<li class="md-nav__item">
@@ -1237,7 +1252,8 @@
</span>
</a>
</li>
<nav class="md-nav" aria-label="Installation">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#omniread--quick-start" class="md-nav__link">
@@ -1246,6 +1262,11 @@
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -1255,6 +1276,15 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#omniread--core-philosophy" class="md-nav__link">
<span class="md-ellipsis">
Core Philosophy
</span>
</a>
</li>
<li class="md-nav__item">
@@ -1746,105 +1776,118 @@
<div class="doc doc-contents first">
<p>OmniRead — format-agnostic content acquisition and parsing framework.</p>
<hr />
<h4 id="omniread--summary">Summary</h4>
<p>OmniRead provides a <strong>cleanly layered architecture</strong> for fetching, parsing,
<h3 id="omniread--summary">Summary</h3>
<p><code>OmniRead</code> — format-agnostic content acquisition and parsing framework.</p>
<p><code>OmniRead</code> provides a <strong>cleanly layered architecture</strong> for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.</p>
<p>The library is structured around three core concepts:</p>
<ol>
<li><strong>Content</strong>: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.</li>
<li><strong>Scrapers</strong>: Components responsible for <em>acquiring</em> raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.</li>
<li><strong>Parsers</strong>: Components responsible for <em>interpreting</em> acquired content and converting it into structured, typed representations.</li>
<li><strong><code>Content</code></strong>: A canonical, format-agnostic container representing raw content
bytes and minimal contextual metadata.</li>
<li><strong><code>Scrapers</code></strong>: Components responsible for <em>acquiring</em> raw content from a
source (HTTP, filesystem, object storage, etc.). <code>Scrapers</code> never interpret
content.</li>
<li><strong><code>Parsers</code></strong>: Components responsible for <em>interpreting</em> acquired content and
converting it into structured, typed representations.</li>
</ol>
<p>OmniRead deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation
- Replaceable implementations per format
- Predictable, testable behavior</p>
<hr />
<h4 id="omniread--installation">Installation</h4>
<p>Install OmniRead using pip:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>pip install omniread
</code></pre></div></td></tr></table></div>
<p>Or with Poetry:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>poetry add omniread
</code></pre></div></td></tr></table></div>
<p><code>OmniRead</code> deliberately separates these responsibilities to ensure:</p>
<ul>
<li>Clear boundaries between IO and interpretation.</li>
<li>Replaceable implementations per format.</li>
<li>Predictable, testable behavior.</li>
</ul>
<h3 id="omniread--installation">Installation</h3>
<p>Install <code>OmniRead</code> using pip:</p>
<div class="language-bash highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-0-1">1</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-0-1"><a id="__codelineno-0-1" name="__codelineno-0-1"></a>pip<span class="w"> </span>install<span class="w"> </span>omniread
</span></code></pre></div></td></tr></table></div>
<p>Install OmniRead using Poetry:
<div class="language-bash highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-1-1">1</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-1-1"><a id="__codelineno-1-1" name="__codelineno-1-1"></a>poetry<span class="w"> </span>add<span class="w"> </span>omniread
</span></code></pre></div></td></tr></table></div></p>
<hr />
<h4 id="omniread--quick-start">Quick start</h4>
<p>HTML example:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span></pre></div></td><td class="code"><div><pre><span></span><code>from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch(&quot;https://example.com&quot;)
class TitleParser(HTMLParser[str]):
def parse(self) -&gt; str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
</code></pre></div></td></tr></table></div>
<p>PDF example:</p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span></pre></div></td><td class="code"><div><pre><span></span><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path(&quot;document.pdf&quot;))
class TextPDFParser(PDFParser[str]):
def parse(self) -&gt; str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
</code></pre></div></td></tr></table></div>
<hr />
<h4 id="omniread--public-api">Public API</h4>
<details class="example" open>
<summary>Example</summary>
<p>HTML example:
<div class="language-python highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-0-1"> 1</a></span>
<span class="normal"><a href="#__codelineno-0-2"> 2</a></span>
<span class="normal"><a href="#__codelineno-0-3"> 3</a></span>
<span class="normal"><a href="#__codelineno-0-4"> 4</a></span>
<span class="normal"><a href="#__codelineno-0-5"> 5</a></span>
<span class="normal"><a href="#__codelineno-0-6"> 6</a></span>
<span class="normal"><a href="#__codelineno-0-7"> 7</a></span>
<span class="normal"><a href="#__codelineno-0-8"> 8</a></span>
<span class="normal"><a href="#__codelineno-0-9"> 9</a></span>
<span class="normal"><a href="#__codelineno-0-10">10</a></span>
<span class="normal"><a href="#__codelineno-0-11">11</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-0-1"><a id="__codelineno-0-1" name="__codelineno-0-1"></a><span class="kn">from</span><span class="w"> </span><span class="nn">omniread</span><span class="w"> </span><span class="kn">import</span> <span class="n">HTMLScraper</span><span class="p">,</span> <span class="n">HTMLParser</span>
</span><span id="__span-0-2"><a id="__codelineno-0-2" name="__codelineno-0-2"></a>
</span><span id="__span-0-3"><a id="__codelineno-0-3" name="__codelineno-0-3"></a><span class="n">scraper</span> <span class="o">=</span> <span class="n">HTMLScraper</span><span class="p">()</span>
</span><span id="__span-0-4"><a id="__codelineno-0-4" name="__codelineno-0-4"></a><span class="n">content</span> <span class="o">=</span> <span class="n">scraper</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="s2">&quot;https://example.com&quot;</span><span class="p">)</span>
</span><span id="__span-0-5"><a id="__codelineno-0-5" name="__codelineno-0-5"></a>
</span><span id="__span-0-6"><a id="__codelineno-0-6" name="__codelineno-0-6"></a><span class="k">class</span><span class="w"> </span><span class="nc">TitleParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
</span><span id="__span-0-7"><a id="__codelineno-0-7" name="__codelineno-0-7"></a> <span class="k">def</span><span class="w"> </span><span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span><span id="__span-0-8"><a id="__codelineno-0-8" name="__codelineno-0-8"></a> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
</span><span id="__span-0-9"><a id="__codelineno-0-9" name="__codelineno-0-9"></a>
</span><span id="__span-0-10"><a id="__codelineno-0-10" name="__codelineno-0-10"></a><span class="n">parser</span> <span class="o">=</span> <span class="n">TitleParser</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</span><span id="__span-0-11"><a id="__codelineno-0-11" name="__codelineno-0-11"></a><span class="n">title</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
</span></code></pre></div></td></tr></table></div></p>
<p>PDF example:
<div class="language-python highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-1-1"> 1</a></span>
<span class="normal"><a href="#__codelineno-1-2"> 2</a></span>
<span class="normal"><a href="#__codelineno-1-3"> 3</a></span>
<span class="normal"><a href="#__codelineno-1-4"> 4</a></span>
<span class="normal"><a href="#__codelineno-1-5"> 5</a></span>
<span class="normal"><a href="#__codelineno-1-6"> 6</a></span>
<span class="normal"><a href="#__codelineno-1-7"> 7</a></span>
<span class="normal"><a href="#__codelineno-1-8"> 8</a></span>
<span class="normal"><a href="#__codelineno-1-9"> 9</a></span>
<span class="normal"><a href="#__codelineno-1-10">10</a></span>
<span class="normal"><a href="#__codelineno-1-11">11</a></span>
<span class="normal"><a href="#__codelineno-1-12">12</a></span>
<span class="normal"><a href="#__codelineno-1-13">13</a></span>
<span class="normal"><a href="#__codelineno-1-14">14</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-1-1"><a id="__codelineno-1-1" name="__codelineno-1-1"></a><span class="kn">from</span><span class="w"> </span><span class="nn">omniread</span><span class="w"> </span><span class="kn">import</span> <span class="n">FileSystemPDFClient</span><span class="p">,</span> <span class="n">PDFScraper</span><span class="p">,</span> <span class="n">PDFParser</span>
</span><span id="__span-1-2"><a id="__codelineno-1-2" name="__codelineno-1-2"></a><span class="kn">from</span><span class="w"> </span><span class="nn">pathlib</span><span class="w"> </span><span class="kn">import</span> <span class="n">Path</span>
</span><span id="__span-1-3"><a id="__codelineno-1-3" name="__codelineno-1-3"></a>
</span><span id="__span-1-4"><a id="__codelineno-1-4" name="__codelineno-1-4"></a><span class="n">client</span> <span class="o">=</span> <span class="n">FileSystemPDFClient</span><span class="p">()</span>
</span><span id="__span-1-5"><a id="__codelineno-1-5" name="__codelineno-1-5"></a><span class="n">scraper</span> <span class="o">=</span> <span class="n">PDFScraper</span><span class="p">(</span><span class="n">client</span><span class="o">=</span><span class="n">client</span><span class="p">)</span>
</span><span id="__span-1-6"><a id="__codelineno-1-6" name="__codelineno-1-6"></a><span class="n">content</span> <span class="o">=</span> <span class="n">scraper</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">&quot;document.pdf&quot;</span><span class="p">))</span>
</span><span id="__span-1-7"><a id="__codelineno-1-7" name="__codelineno-1-7"></a>
</span><span id="__span-1-8"><a id="__codelineno-1-8" name="__codelineno-1-8"></a><span class="k">class</span><span class="w"> </span><span class="nc">TextPDFParser</span><span class="p">(</span><span class="n">PDFParser</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
</span><span id="__span-1-9"><a id="__codelineno-1-9" name="__codelineno-1-9"></a> <span class="k">def</span><span class="w"> </span><span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span><span id="__span-1-10"><a id="__codelineno-1-10" name="__codelineno-1-10"></a> <span class="c1"># implement PDF text extraction</span>
</span><span id="__span-1-11"><a id="__codelineno-1-11" name="__codelineno-1-11"></a> <span class="o">...</span>
</span><span id="__span-1-12"><a id="__codelineno-1-12" name="__codelineno-1-12"></a>
</span><span id="__span-1-13"><a id="__codelineno-1-13" name="__codelineno-1-13"></a><span class="n">parser</span> <span class="o">=</span> <span class="n">TextPDFParser</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</span><span id="__span-1-14"><a id="__codelineno-1-14" name="__codelineno-1-14"></a><span class="n">result</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
</span></code></pre></div></td></tr></table></div></p>
</details> <hr />
<h3 id="omniread--public-api">Public API</h3>
<p>This module re-exports the <strong>recommended public entry points</strong> of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.</p>
<p><strong>Core:</strong>
- Content
- ContentType</p>
<p><strong>HTML:</strong>
- HTMLScraper
- HTMLParser</p>
<p><strong>PDF:</strong>
- FileSystemPDFClient
- PDFScraper
- PDFParser</p>
<p><strong>Core Philosophy:</strong>
<code>OmniRead</code> is designed as a <strong>decoupled content engine</strong>:
1. <strong>Separation of Concerns</strong>: Scrapers <em>fetch</em>, Parsers <em>interpret</em>. Neither knows about the other.
2. <strong>Normalized Exchange</strong>: All components communicate via the <code>Content</code> model, ensuring a consistent contract.
3. <strong>Format Agnosticism</strong>: The core logic is independent of whether the input is HTML, PDF, or JSON.</p>
<ul>
<li><code>Content</code>: Canonical content model.</li>
<li><code>ContentType</code>: Supported media types.</li>
<li><code>HTMLScraper</code>: HTTP-based HTML acquisition.</li>
<li><code>HTMLParser</code>: Base parser for HTML DOM interpretation.</li>
<li><code>FileSystemPDFClient</code>: Local filesystem PDF access.</li>
<li><code>PDFScraper</code>: PDF-specific content acquisition.</li>
<li><code>PDFParser</code>: Base parser for PDF binary interpretation.</li>
</ul>
<hr />
<h3 id="omniread--core-philosophy">Core Philosophy</h3>
<p><code>OmniRead</code> is designed as a <strong>decoupled content engine</strong>:</p>
<ol>
<li><strong>Separation of Concerns</strong>: Scrapers <em>fetch</em>, Parsers <em>interpret</em>. Neither
knows about the other.</li>
<li><strong>Normalized Exchange</strong>: All components communicate via the <code>Content</code> model,
ensuring a consistent contract.</li>
<li><strong>Format Agnosticism</strong>: The core logic is independent of whether the input
is HTML, PDF, or JSON.</li>
</ol>
<hr />
@@ -1884,8 +1927,12 @@ required.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with
minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
parsers, and downstream consumers.
</code></pre></div></td></tr></table></div>
</details>
@@ -2026,8 +2073,12 @@ required.</p>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the content source
- It is primarily used for routing content to the appropriate parser or downstream consumer
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the
content source.
- It is primarily used for routing content to the appropriate
parser or downstream consumer.
</code></pre></div></td></tr></table></div>
</details>
@@ -2169,7 +2220,9 @@ required.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns their raw binary contents
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns
their raw binary contents.
</code></pre></div></td></tr></table></div>
</details>
@@ -2311,7 +2364,7 @@ required.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
<p>Base HTML parser.</p>
@@ -2321,14 +2374,24 @@ required.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
explicitly define the return type.
</code></pre></div></td></tr></table></div>
<p><strong>Guarantees:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement
the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML parser.</p>
@@ -2348,7 +2411,7 @@ required.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -2482,7 +2545,9 @@ required.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a
deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -2698,8 +2763,10 @@ Dictionary containing extracted metadata.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document
- This includes: Document title, `&lt;meta&gt;` tag name/property → content mappings
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document.
- This includes: Document title, `&lt;meta&gt;` tag name/property to
content mappings.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -2846,21 +2913,29 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Base HTML scraper using httpx.</p>
<p>Base HTML scraper using <code>httpx</code>.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns
them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
HTML content type, and preserves HTTP response metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff,
handle non-HTML responses.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the HTML scraper.</p>
@@ -3019,7 +3094,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -3160,7 +3235,7 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
<p>Base PDF parser.</p>
@@ -3169,10 +3244,14 @@ A list of rows, where each row is a list of cell text values.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides
the extension point for implementing concrete PDF parsing strategies.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must define the output type `T` and
implement the `parse()` method.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the parser with content to be parsed.</p>
@@ -3192,7 +3271,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tr class="doc-section-item">
<td><code>content</code></td>
<td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -3335,7 +3414,9 @@ A list of rows, where each row is a list of cell text values.</p>
<details class="notes" open>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and
return a deterministic, structured output.
</code></pre></div></td></tr></table></div>
</details>
</div>
@@ -3406,7 +3487,7 @@ A list of rows, where each row is a list of cell text values.</p>
<div class="doc doc-contents ">
<p class="doc doc-class-bases">
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
<p>Scraper for PDF sources.</p>
@@ -3416,11 +3497,15 @@ A list of rows, where each row is a list of cell text values.</p>
<summary>Notes</summary>
<p><strong>Responsibilities:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output into Content
- Preserves caller-provided metadata
<span class="normal">2</span>
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output
into `Content`.
- Preserves caller-provided metadata.
</code></pre></div></td></tr></table></div>
<p><strong>Constraints:</strong></p>
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.
</code></pre></div></td></tr></table></div>
</details>
<p>Initialize the PDF scraper.</p>
@@ -3537,7 +3622,7 @@ A list of rows, where each row is a list of cell text values.</p>
<tbody>
<tr class="doc-section-item">
<td><code>Content</code></td> <td>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
</td>
<td>
<div class="doc-md-description">
@@ -3590,9 +3675,7 @@ A list of rows, where each row is a list of cell text values.</p>
</div>
</div><ul>
<li><a href="omniread/">Omniread</a></li>
</ul>
</div>