Scraper
omniread.html.scraper
HTML scraping implementation for OmniRead.
Summary
This module provides an HTTP-based scraper for retrieving HTML documents.
It implements the core BaseScraper contract using httpx as the transport
layer.
This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content
This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting
Classes
HTMLScraper
Bases: BaseScraper
Base HTML scraper using httpx.
Notes
Responsibilities:
1 2 | |
Constraints:
1 | |
Initialize the HTML scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
Client | None
|
Optional pre-configured |
None
|
timeout |
float
|
Request timeout in seconds. |
15.0
|
headers |
Optional[Mapping[str, str]]
|
Optional default HTTP headers. |
None
|
follow_redirects |
bool
|
Whether to follow HTTP redirects. |
True
|
Functions
fetch
Fetch an HTML document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
URL of the HTML document. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to be merged into the returned content. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Content |
Content
|
A |
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the HTTP request fails. |
ValueError
|
If the response is not valid HTML. |
validate_content_type
Validate that the HTTP response contains HTML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response |
Response
|
HTTP response returned by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the |