Scraper

omniread.html.scraper

HTML scraping implementation for OmniRead.

Summary

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

Classes

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name	Type	Description	Default
`client`	`Client \| None`	Optional pre-configured `httpx.Client`. If omitted, a client is created internally.	`None`
`timeout`	`float`	Request timeout in seconds.	`15.0`
`headers`	`Optional[Mapping[str, str]]`	Optional default HTTP headers.	`None`
`follow_redirects`	`bool`	Whether to follow HTTP redirects.	`True`

Functions

fetch

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`str`	URL of the HTML document.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to be merged into the returned content.	`None`

Returns:

Name	Type	Description
`Content`	`Content`	A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type	Description
`HTTPError`	If the HTTP request fails.
`ValueError`	If the response is not valid HTML.

validate_content_type

validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name	Type	Description	Default
`response`	`Response`	HTTP response returned by `httpx`.	required

Raises:

Type	Description
`ValueError`	If the `Content-Type` header is missing or does not indicate HTML content.