This commit is contained in:
@@ -896,19 +896,22 @@
|
||||
|
||||
<div class="doc doc-contents first">
|
||||
|
||||
<p>Abstract scraping contracts for OmniRead.</p>
|
||||
<hr />
|
||||
<h4 id="omniread.core.scraper--summary">Summary</h4>
|
||||
<h3 id="omniread.core.scraper--summary">Summary</h3>
|
||||
<p>Abstract scraping contracts for OmniRead.</p>
|
||||
<p>This module defines the <strong>format-agnostic scraper interface</strong> responsible for
|
||||
acquiring raw content from external sources.</p>
|
||||
<p>Scrapers are responsible for:
|
||||
- Locating and retrieving raw content bytes
|
||||
- Attaching minimal contextual metadata
|
||||
- Returning normalized <code>Content</code> objects</p>
|
||||
<p>Scrapers are explicitly NOT responsible for:
|
||||
- Parsing or interpreting content
|
||||
- Inferring structure or semantics
|
||||
- Performing content-type specific processing</p>
|
||||
<p>Scrapers are responsible for:</p>
|
||||
<ul>
|
||||
<li>Locating and retrieving raw content bytes</li>
|
||||
<li>Attaching minimal contextual metadata</li>
|
||||
<li>Returning normalized <code>Content</code> objects</li>
|
||||
</ul>
|
||||
<p>Scrapers are explicitly NOT responsible for:</p>
|
||||
<ul>
|
||||
<li>Parsing or interpreting content</li>
|
||||
<li>Inferring structure or semantics</li>
|
||||
<li>Performing content-type specific processing</li>
|
||||
</ul>
|
||||
<p>All interpretation must be delegated to parsers.</p>
|
||||
|
||||
|
||||
@@ -947,13 +950,21 @@ acquiring raw content from external sources.</p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span>
|
||||
<span class="normal">3</span>
|
||||
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code>- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it
|
||||
- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object
|
||||
- Scrapers define how content is obtained, not what the content means
|
||||
- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior
|
||||
<span class="normal">4</span>
|
||||
<span class="normal">5</span>
|
||||
<span class="normal">6</span>
|
||||
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><code>- A scraper is responsible ONLY for fetching raw content (bytes)
|
||||
from a source. It must not interpret or parse it.
|
||||
- A scraper is a stateless acquisition component that retrieves raw
|
||||
content from a source and returns it as a `Content` object.
|
||||
- Scrapers define how content is obtained, not what the content means.
|
||||
- Implementations may vary in transport mechanism, authentication
|
||||
strategy, retry and backoff behavior.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
<p><strong>Constraints:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must not parse content, modify content semantics,
|
||||
or couple scraping logic to a specific parser.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
|
||||
@@ -1007,7 +1018,7 @@ acquiring raw content from external sources.</p>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
<p>Location identifier (URL, file path, S3 URI, etc.)</p>
|
||||
<p>Location identifier (URL, file path, S3 URI, etc.).</p>
|
||||
</div>
|
||||
</td>
|
||||
<td>
|
||||
@@ -1021,7 +1032,7 @@ acquiring raw content from external sources.</p>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
<p>Optional hints for the scraper (headers, auth, etc.)</p>
|
||||
<p>Optional hints for the scraper (headers, auth, etc.).</p>
|
||||
</div>
|
||||
</td>
|
||||
<td>
|
||||
@@ -1043,7 +1054,7 @@ acquiring raw content from external sources.</p>
|
||||
<tbody>
|
||||
<tr class="doc-section-item">
|
||||
<td><code>Content</code></td> <td>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../../omniread/core/content/#omniread.core.content.Content">Content</a></code>
|
||||
<code><a class="autorefs autorefs-internal" title="omniread.core.content.Content" href="../content/#omniread.core.content.Content">Content</a></code>
|
||||
</td>
|
||||
<td>
|
||||
<div class="doc-md-description">
|
||||
@@ -1081,7 +1092,9 @@ acquiring raw content from external sources.</p>
|
||||
<details class="notes" open>
|
||||
<summary>Notes</summary>
|
||||
<p><strong>Responsibilities:</strong></p>
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object
|
||||
<div class="language-text highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span>
|
||||
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><code>- Implementations must retrieve the content referenced by `source`
|
||||
and return it as raw bytes wrapped in a `Content` object.
|
||||
</code></pre></div></td></tr></table></div>
|
||||
</details>
|
||||
</div>
|
||||
|
||||
Reference in New Issue
Block a user