This commit is contained in:
@@ -298,7 +298,8 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
<nav class="md-nav" aria-label="Installation">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#omniread--quick-start" class="md-nav__link">
|
||||
@@ -307,6 +308,11 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -316,6 +322,15 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#omniread--core-philosophy" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
Core Philosophy
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1237,7 +1252,8 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
<nav class="md-nav" aria-label="Installation">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#omniread--quick-start" class="md-nav__link">
|
||||
@@ -1246,6 +1262,11 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1255,6 +1276,15 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#omniread--core-philosophy" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
Core Philosophy
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1746,105 +1776,118 @@
|
||||
|
||||
<div class="doc doc-contents first">
|
||||
|
||||
<p>OmniRead — format-agnostic content acquisition and parsing framework.</p>
|
||||
<hr />
|
||||
<h4 id="omniread--summary">Summary</h4>
|
||||
<p>OmniRead provides a <strong>cleanly layered architecture</strong> for fetching, parsing,
|
||||
<h3 id="omniread--summary">Summary</h3>
|
||||
<p><code>OmniRead</code> — format-agnostic content acquisition and parsing framework.</p>
|
||||
<p><code>OmniRead</code> provides a <strong>cleanly layered architecture</strong> for fetching, parsing,
|
||||
and normalizing content from heterogeneous sources such as HTML documents
|
||||
and PDF files.</p>
|
||||
<p>The library is structured around three core concepts:</p>
|
||||
<ol>
|
||||
<li><strong>Content</strong>: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.</li>
|
||||
<li><strong>Scrapers</strong>: Components responsible for <em>acquiring</em> raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.</li>
|
||||
<li><strong>Parsers</strong>: Components responsible for <em>interpreting</em> acquired content and converting it into structured, typed representations.</li>
|
||||
<li><strong><code>Content</code></strong>: A canonical, format-agnostic container representing raw content
|
||||
bytes and minimal contextual metadata.</li>
|
||||
<li><strong><code>Scrapers</code></strong>: Components responsible for <em>acquiring</em> raw content from a
|
||||
source (HTTP, filesystem, object storage, etc.). <code>Scrapers</code> never interpret
|
||||
content.</li>
|
||||
<li><strong><code>Parsers</code></strong>: Components responsible for <em>interpreting</em> acquired content and
|
||||
converting it into structured, typed representations.</li>
|
||||
</ol>
|
||||
<p>OmniRead deliberately separates these responsibilities to ensure:
|
||||
- Clear boundaries between IO and interpretation
|
||||
- Replaceable implementations per format
|
||||
- Predictable, testable behavior</p>
|
||||
<hr />
|
||||
<h4 id="omniread--installation">Installation</h4>
|
||||
<p>Install OmniRead using pip:</p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>pip install omniread
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p>Or with Poetry:</p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>poetry add omniread
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><code>OmniRead</code> deliberately separates these responsibilities to ensure:</p>
|
||||
<ul>
|
||||
<li>Clear boundaries between IO and interpretation.</li>
|
||||
<li>Replaceable implementations per format.</li>
|
||||
<li>Predictable, testable behavior.</li>
|
||||
</ul>
|
||||
<h3 id="omniread--installation">Installation</h3>
|
||||
<p>Install <code>OmniRead</code> using pip:</p>
|
||||
<div class="language-bash highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-0-1">1</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-0-1"><a id="__codelineno-0-1" name="__codelineno-0-1"></a>pip<span class="w"> </span>install<span class="w"> </span>omniread
|
||||
</span></code></pre></div></td></tr></table></div>
|
||||
<p>Install OmniRead using Poetry:
|
||||
<div class="language-bash highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-1-1">1</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-1-1"><a id="__codelineno-1-1" name="__codelineno-1-1"></a>poetry<span class="w"> </span>add<span class="w"> </span>omniread
|
||||
</span></code></pre></div></td></tr></table></div></p>
|
||||
<hr />
|
||||
<h4 id="omniread--quick-start">Quick start</h4>
|
||||
<p>HTML example:</p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"> 1</span>
|
||||
<span class="normal"> 2</span>
|
||||
<span class="normal"> 3</span>
|
||||
<span class="normal"> 4</span>
|
||||
<span class="normal"> 5</span>
|
||||
<span class="normal"> 6</span>
|
||||
<span class="normal"> 7</span>
|
||||
<span class="normal"> 8</span>
|
||||
<span class="normal"> 9</span>
|
||||
<span class="normal">10</span>
|
||||
<span class="normal">11</span></pre></div></td><td class="code"><div><pre><span></span><code>from omniread import HTMLScraper, HTMLParser
|
||||
|
||||
scraper = HTMLScraper()
|
||||
content = scraper.fetch("https://example.com")
|
||||
|
||||
class TitleParser(HTMLParser[str]):
|
||||
def parse(self) -> str:
|
||||
return self._soup.title.string
|
||||
|
||||
parser = TitleParser(content)
|
||||
title = parser.parse()
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p>PDF example:</p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"> 1</span>
|
||||
<span class="normal"> 2</span>
|
||||
<span class="normal"> 3</span>
|
||||
<span class="normal"> 4</span>
|
||||
<span class="normal"> 5</span>
|
||||
<span class="normal"> 6</span>
|
||||
<span class="normal"> 7</span>
|
||||
<span class="normal"> 8</span>
|
||||
<span class="normal"> 9</span>
|
||||
<span class="normal">10</span>
|
||||
<span class="normal">11</span>
|
||||
<span class="normal">12</span>
|
||||
<span class="normal">13</span>
|
||||
<span class="normal">14</span></pre></div></td><td class="code"><div><pre><span></span><code>from omniread import FileSystemPDFClient, PDFScraper, PDFParser
|
||||
from pathlib import Path
|
||||
|
||||
client = FileSystemPDFClient()
|
||||
scraper = PDFScraper(client=client)
|
||||
content = scraper.fetch(Path("document.pdf"))
|
||||
|
||||
class TextPDFParser(PDFParser[str]):
|
||||
def parse(self) -> str:
|
||||
# implement PDF text extraction
|
||||
...
|
||||
|
||||
parser = TextPDFParser(content)
|
||||
result = parser.parse()
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<hr />
|
||||
<h4 id="omniread--public-api">Public API</h4>
|
||||
<details class="example" open>
|
||||
<summary>Example</summary>
|
||||
<p>HTML example:
|
||||
<div class="language-python highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-0-1"> 1</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-2"> 2</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-3"> 3</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-4"> 4</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-5"> 5</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-6"> 6</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-7"> 7</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-8"> 8</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-9"> 9</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-10">10</a></span>
|
||||
<span class="normal"><a href="#__codelineno-0-11">11</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-0-1"><a id="__codelineno-0-1" name="__codelineno-0-1"></a><span class="kn">from</span><span class="w"> </span><span class="nn">omniread</span><span class="w"> </span><span class="kn">import</span> <span class="n">HTMLScraper</span><span class="p">,</span> <span class="n">HTMLParser</span>
|
||||
</span><span id="__span-0-2"><a id="__codelineno-0-2" name="__codelineno-0-2"></a>
|
||||
</span><span id="__span-0-3"><a id="__codelineno-0-3" name="__codelineno-0-3"></a><span class="n">scraper</span> <span class="o">=</span> <span class="n">HTMLScraper</span><span class="p">()</span>
|
||||
</span><span id="__span-0-4"><a id="__codelineno-0-4" name="__codelineno-0-4"></a><span class="n">content</span> <span class="o">=</span> <span class="n">scraper</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="s2">"https://example.com"</span><span class="p">)</span>
|
||||
</span><span id="__span-0-5"><a id="__codelineno-0-5" name="__codelineno-0-5"></a>
|
||||
</span><span id="__span-0-6"><a id="__codelineno-0-6" name="__codelineno-0-6"></a><span class="k">class</span><span class="w"> </span><span class="nc">TitleParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
|
||||
</span><span id="__span-0-7"><a id="__codelineno-0-7" name="__codelineno-0-7"></a> <span class="k">def</span><span class="w"> </span><span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
|
||||
</span><span id="__span-0-8"><a id="__codelineno-0-8" name="__codelineno-0-8"></a> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
|
||||
</span><span id="__span-0-9"><a id="__codelineno-0-9" name="__codelineno-0-9"></a>
|
||||
</span><span id="__span-0-10"><a id="__codelineno-0-10" name="__codelineno-0-10"></a><span class="n">parser</span> <span class="o">=</span> <span class="n">TitleParser</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
|
||||
</span><span id="__span-0-11"><a id="__codelineno-0-11" name="__codelineno-0-11"></a><span class="n">title</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
|
||||
</span></code></pre></div></td></tr></table></div></p>
|
||||
<p>PDF example:
|
||||
<div class="language-python highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal"><a href="#__codelineno-1-1"> 1</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-2"> 2</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-3"> 3</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-4"> 4</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-5"> 5</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-6"> 6</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-7"> 7</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-8"> 8</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-9"> 9</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-10">10</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-11">11</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-12">12</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-13">13</a></span>
|
||||
<span class="normal"><a href="#__codelineno-1-14">14</a></span></pre></div></td><td class="code"><div><pre><span></span><code><span id="__span-1-1"><a id="__codelineno-1-1" name="__codelineno-1-1"></a><span class="kn">from</span><span class="w"> </span><span class="nn">omniread</span><span class="w"> </span><span class="kn">import</span> <span class="n">FileSystemPDFClient</span><span class="p">,</span> <span class="n">PDFScraper</span><span class="p">,</span> <span class="n">PDFParser</span>
|
||||
</span><span id="__span-1-2"><a id="__codelineno-1-2" name="__codelineno-1-2"></a><span class="kn">from</span><span class="w"> </span><span class="nn">pathlib</span><span class="w"> </span><span class="kn">import</span> <span class="n">Path</span>
|
||||
</span><span id="__span-1-3"><a id="__codelineno-1-3" name="__codelineno-1-3"></a>
|
||||
</span><span id="__span-1-4"><a id="__codelineno-1-4" name="__codelineno-1-4"></a><span class="n">client</span> <span class="o">=</span> <span class="n">FileSystemPDFClient</span><span class="p">()</span>
|
||||
</span><span id="__span-1-5"><a id="__codelineno-1-5" name="__codelineno-1-5"></a><span class="n">scraper</span> <span class="o">=</span> <span class="n">PDFScraper</span><span class="p">(</span><span class="n">client</span><span class="o">=</span><span class="n">client</span><span class="p">)</span>
|
||||
</span><span id="__span-1-6"><a id="__codelineno-1-6" name="__codelineno-1-6"></a><span class="n">content</span> <span class="o">=</span> <span class="n">scraper</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"document.pdf"</span><span class="p">))</span>
|
||||
</span><span id="__span-1-7"><a id="__codelineno-1-7" name="__codelineno-1-7"></a>
|
||||
</span><span id="__span-1-8"><a id="__codelineno-1-8" name="__codelineno-1-8"></a><span class="k">class</span><span class="w"> </span><span class="nc">TextPDFParser</span><span class="p">(</span><span class="n">PDFParser</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
|
||||
</span><span id="__span-1-9"><a id="__codelineno-1-9" name="__codelineno-1-9"></a> <span class="k">def</span><span class="w"> </span><span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
|
||||
</span><span id="__span-1-10"><a id="__codelineno-1-10" name="__codelineno-1-10"></a> <span class="c1"># implement PDF text extraction</span>
|
||||
</span><span id="__span-1-11"><a id="__codelineno-1-11" name="__codelineno-1-11"></a> <span class="o">...</span>
|
||||
</span><span id="__span-1-12"><a id="__codelineno-1-12" name="__codelineno-1-12"></a>
|
||||
</span><span id="__span-1-13"><a id="__codelineno-1-13" name="__codelineno-1-13"></a><span class="n">parser</span> <span class="o">=</span> <span class="n">TextPDFParser</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
|
||||
</span><span id="__span-1-14"><a id="__codelineno-1-14" name="__codelineno-1-14"></a><span class="n">result</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
|
||||
</span></code></pre></div></td></tr></table></div></p>
|
||||
</details> <hr />
|
||||
<h3 id="omniread--public-api">Public API</h3>
|
||||
<p>This module re-exports the <strong>recommended public entry points</strong> of OmniRead.
|
||||
Consumers are encouraged to import from this namespace rather than from
|
||||
format-specific submodules directly, unless advanced customization is
|
||||
required.</p>
|
||||
<p><strong>Core:</strong>
|
||||
- Content
|
||||
- ContentType</p>
|
||||
<p><strong>HTML:</strong>
|
||||
- HTMLScraper
|
||||
- HTMLParser</p>
|
||||
<p><strong>PDF:</strong>
|
||||
- FileSystemPDFClient
|
||||
- PDFScraper
|
||||
- PDFParser</p>
|
||||
<p><strong>Core Philosophy:</strong>
|
||||
<code>OmniRead</code> is designed as a <strong>decoupled content engine</strong>:
|
||||
1. <strong>Separation of Concerns</strong>: Scrapers <em>fetch</em>, Parsers <em>interpret</em>. Neither knows about the other.
|
||||
2. <strong>Normalized Exchange</strong>: All components communicate via the <code>Content</code> model, ensuring a consistent contract.
|
||||
3. <strong>Format Agnosticism</strong>: The core logic is independent of whether the input is HTML, PDF, or JSON.</p>
|
||||
<ul>
|
||||
<li><code>Content</code>: Canonical content model.</li>
|
||||
<li><code>ContentType</code>: Supported media types.</li>
|
||||
<li><code>HTMLScraper</code>: HTTP-based HTML acquisition.</li>
|
||||
<li><code>HTMLParser</code>: Base parser for HTML DOM interpretation.</li>
|
||||
<li><code>FileSystemPDFClient</code>: Local filesystem PDF access.</li>
|
||||
<li><code>PDFScraper</code>: PDF-specific content acquisition.</li>
|
||||
<li><code>PDFParser</code>: Base parser for PDF binary interpretation.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h3 id="omniread--core-philosophy">Core Philosophy</h3>
|
||||
<p><code>OmniRead</code> is designed as a <strong>decoupled content engine</strong>:</p>
|
||||
<ol>
|
||||
<li><strong>Separation of Concerns</strong>: Scrapers <em>fetch</em>, Parsers <em>interpret</em>. Neither
|
||||
knows about the other.</li>
|
||||
<li><strong>Normalized Exchange</strong>: All components communicate via the <code>Content</code> model,
|
||||
ensuring a consistent contract.</li>
|
||||
<li><strong>Format Agnosticism</strong>: The core logic is independent of whether the input
|
||||
is HTML, PDF, or JSON.</li>
|
||||
</ol>
|
||||
<hr />
|
||||
|
||||
|
||||
@@ -1884,8 +1927,12 @@ required.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type
|
||||
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span>
|
||||
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A `Content` instance represents a raw content payload along with
|
||||
minimal contextual metadata describing its origin and type.
|
||||
- This class is the primary exchange format between scrapers,
|
||||
parsers, and downstream consumers.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
|
||||
@@ -2026,8 +2073,12 @@ required.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Guarantees:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the content source
|
||||
- It is primarily used for routing content to the appropriate parser or downstream consumer
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span>
|
||||
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This enum represents the declared or inferred media type of the
|
||||
content source.
|
||||
- It is primarily used for routing content to the appropriate
|
||||
parser or downstream consumer.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
|
||||
@@ -2169,7 +2220,9 @@ required.</p>
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Guarantees:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns their raw binary contents
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This client reads PDF files directly from the disk and returns
|
||||
their raw binary contents.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
|
||||
@@ -2311,7 +2364,7 @@ required.</p>
|
||||
|
||||
<div class="doc doc-contents ">
|
||||
<p class="doc doc-class-bases">
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
|
||||
|
||||
|
||||
<p>Base HTML parser.</p>
|
||||
@@ -2321,14 +2374,24 @@ required.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
|
||||
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span>
|
||||
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior,
|
||||
including DOM parsing via BeautifulSoup and reusable extraction helpers.
|
||||
- Provides reusable helpers for HTML extraction. Concrete parsers must
|
||||
explicitly define the return type.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Guarantees:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Accepts only HTML content.
|
||||
- Owns a parsed BeautifulSoup DOM tree.
|
||||
- Provides pure helper utilities for common HTML structures.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Constraints:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement
|
||||
the `parse()` method.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
<p>Initialize the HTML parser.</p>
|
||||
@@ -2348,7 +2411,7 @@ required.</p>
|
||||
<tr class="doc-section-item">
|
||||
<td><code>content</code></td>
|
||||
<td>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
@@ -2482,7 +2545,9 @@ required.</p>
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a
|
||||
deterministic, structured output.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
</div>
|
||||
@@ -2698,8 +2763,10 @@ Dictionary containing extracted metadata.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document
|
||||
- This includes: Document title, `<meta>` tag name/property → content mappings
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document.
|
||||
- This includes: Document title, `<meta>` tag name/property to
|
||||
content mappings.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
</div>
|
||||
@@ -2846,21 +2913,29 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
|
||||
<div class="doc doc-contents ">
|
||||
<p class="doc doc-class-bases">
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
|
||||
|
||||
|
||||
<p>Base HTML scraper using httpx.</p>
|
||||
<p>Base HTML scraper using <code>httpx</code>.</p>
|
||||
|
||||
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
|
||||
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span>
|
||||
<span class="normal">4</span>
|
||||
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><code>- This scraper retrieves HTML documents over HTTP(S) and returns
|
||||
them as raw content wrapped in a `Content` object.
|
||||
- Fetches raw bytes and metadata only.
|
||||
- The scraper uses `httpx.Client` for HTTP requests, enforces an
|
||||
HTML content type, and preserves HTTP response metadata.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Constraints:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not: Parse HTML, perform retries or backoff,
|
||||
handle non-HTML responses.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
<p>Initialize the HTML scraper.</p>
|
||||
@@ -3019,7 +3094,7 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
<tbody>
|
||||
<tr class="doc-section-item">
|
||||
<td><code>Content</code></td> <td>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
@@ -3160,7 +3235,7 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
|
||||
<div class="doc doc-contents ">
|
||||
<p class="doc doc-class-bases">
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.pdf.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.pdf.parser.T">T</span>]</code></p>
|
||||
|
||||
|
||||
<p>Base PDF parser.</p>
|
||||
@@ -3169,10 +3244,14 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class enforces PDF content-type compatibility and provides
|
||||
the extension point for implementing concrete PDF parsing strategies.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Constraints:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must: Define the output type `T`, implement the `parse()` method
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete implementations must define the output type `T` and
|
||||
implement the `parse()` method.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
<p>Initialize the parser with content to be parsed.</p>
|
||||
@@ -3192,7 +3271,7 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
<tr class="doc-section-item">
|
||||
<td><code>content</code></td>
|
||||
<td>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
@@ -3335,7 +3414,9 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the PDF binary payload and
|
||||
return a deterministic, structured output.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
</div>
|
||||
@@ -3406,7 +3487,7 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
|
||||
<div class="doc doc-contents ">
|
||||
<p class="doc doc-class-bases">
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="omniread/core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.scraper.BaseScraper" href="core/scraper/#omniread.core.scraper.BaseScraper">BaseScraper</a></code></p>
|
||||
|
||||
|
||||
<p>Scraper for PDF sources.</p>
|
||||
@@ -3416,11 +3497,15 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output into Content
|
||||
- Preserves caller-provided metadata
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Delegates byte retrieval to a PDF client and normalizes output
|
||||
into `Content`.
|
||||
- Preserves caller-provided metadata.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Constraints:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- The scraper does not perform parsing or interpretation.
|
||||
- Does not assume a specific storage backend.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
<p>Initialize the PDF scraper.</p>
|
||||
@@ -3537,7 +3622,7 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
<tbody>
|
||||
<tr class="doc-section-item">
|
||||
<td><code>Content</code></td> <td>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="omniread/core/content/#omniread.core.content.Content">Content</a></code>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="core/content/#omniread.core.content.Content">Content</a></code>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
@@ -3590,9 +3675,7 @@ A list of rows, where each row is a list of cell text values.</p>
|
||||
|
||||
</div>
|
||||
|
||||
</div><ul>
|
||||
<li><a href="omniread/">Omniread</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user