Skip to content

Scraper

omniread.html.scraper

HTML scraping implementation for OmniRead.


Summary

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

Classes

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

1
2
- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata

Constraints:

1
- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default
client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None
timeout float

Request timeout in seconds.

15.0
headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None
follow_redirects bool

Whether to follow HTTP redirects.

True
Functions
fetch
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default
source str

URL of the HTML document.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description
Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description
HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

validate_content_type
validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default
response Response

HTTP response returned by httpx.

required

Raises:

Type Description
ValueError

If the Content-Type header is missing or does not indicate HTML content.