Skip to content

Scraper

omniread.html.scraper

Summary

HTML scraping implementation for OmniRead.

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for:

  • Fetching raw HTML bytes over HTTP(S)
  • Validating response content type
  • Attaching HTTP metadata to the returned content

This scraper is not responsible for:

  • Parsing or interpreting HTML
  • Retrying failed requests
  • Managing crawl policies or rate limiting

Classes

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

1
2
3
4
5
- This scraper retrieves HTML documents over HTTP(S) and returns
  them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
  HTML content type, and preserves HTTP response metadata.

Constraints:

1
2
- The scraper does not: Parse HTML, perform retries or backoff,
  handle non-HTML responses.

Initialize the HTML scraper.

Parameters:

Name Type Description Default
client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None
timeout float

Request timeout in seconds.

15.0
headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None
follow_redirects bool

Whether to follow HTTP redirects.

True
Functions
fetch
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default
source str

URL of the HTML document.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description
Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description
HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

validate_content_type
validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default
response Response

HTTP response returned by httpx.

required

Raises:

Type Description
ValueError

If the Content-Type header is missing or does not indicate HTML content.