This commit is contained in:
@@ -1034,15 +1034,16 @@
|
||||
|
||||
<div class="doc doc-contents first">
|
||||
|
||||
<p>HTML parser base implementations for OmniRead.</p>
|
||||
<hr />
|
||||
<h4 id="omniread.html.parser--summary">Summary</h4>
|
||||
<h3 id="omniread.html.parser--summary">Summary</h3>
|
||||
<p>HTML parser base implementations for OmniRead.</p>
|
||||
<p>This module provides reusable HTML parsing utilities built on top of
|
||||
the abstract parser contracts defined in <code>omniread.core.parser</code>.</p>
|
||||
<p>It supplies:
|
||||
- Content-type enforcement for HTML inputs
|
||||
- BeautifulSoup initialization and lifecycle management
|
||||
- Common helper methods for extracting structured data from HTML elements</p>
|
||||
<p>It supplies:</p>
|
||||
<ul>
|
||||
<li>Content-type enforcement for HTML inputs</li>
|
||||
<li>BeautifulSoup initialization and lifecycle management</li>
|
||||
<li>Common helper methods for extracting structured data from HTML elements</li>
|
||||
</ul>
|
||||
<p>Concrete parsers must subclass <code>HTMLParser</code> and implement the <code>parse()</code> method
|
||||
to return a structured representation appropriate for their use case.</p>
|
||||
|
||||
@@ -1071,7 +1072,7 @@ to return a structured representation appropriate for their use case.</p>
|
||||
|
||||
<div class="doc doc-contents ">
|
||||
<p class="doc doc-class-bases">
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../../omniread/core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
|
||||
Bases: <code><a class="autorefs autorefs-internal" title="omniread.core.parser.BaseParser" href="../../core/parser/#omniread.core.parser.BaseParser">BaseParser</a>[<span title="omniread.html.parser.T">T</span>]</code>, <code><span title="typing.Generic">Generic</span>[<span title="omniread.html.parser.T">T</span>]</code></p>
|
||||
|
||||
|
||||
<p>Base HTML parser.</p>
|
||||
@@ -1081,14 +1082,24 @@ to return a structured representation appropriate for their use case.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
|
||||
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span>
|
||||
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- This class extends the core `BaseParser` with HTML-specific behavior,
|
||||
including DOM parsing via BeautifulSoup and reusable extraction helpers.
|
||||
- Provides reusable helpers for HTML extraction. Concrete parsers must
|
||||
explicitly define the return type.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Guarantees:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Accepts only HTML content.
|
||||
- Owns a parsed BeautifulSoup DOM tree.
|
||||
- Provides pure helper utilities for common HTML structures.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Constraints:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement the `parse()` method
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Concrete subclasses must define the output type `T` and implement
|
||||
the `parse()` method.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
<p>Initialize the HTML parser.</p>
|
||||
@@ -1108,7 +1119,7 @@ to return a structured representation appropriate for their use case.</p>
|
||||
<tr class="doc-section-item">
|
||||
<td><code>content</code></td>
|
||||
<td>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../core/content/#omniread.core.content.Content">Content</a></code>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
@@ -1242,7 +1253,9 @@ to return a structured representation appropriate for their use case.</p>
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must fully interpret the HTML DOM and return a
|
||||
deterministic, structured output.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
</div>
|
||||
@@ -1458,8 +1471,10 @@ Dictionary containing extracted metadata.</p>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document
|
||||
- This includes: Document title, `<meta>` tag name/property → content mappings
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span></pre></div></td><td class="code"><div><pre><span></span><code>- Extract high-level metadata from the HTML document.
|
||||
- This includes: Document title, `<meta>` tag name/property to
|
||||
content mappings.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
</div>
|
||||
|
||||
Reference in New Issue
Block a user