Skip to content

Html

omniread.html

HTML format implementation for OmniRead.


Summary

This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.

It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents

This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in omniread.core.content

Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.


Public API

1
2
HTMLScraper
HTMLParser

Classes

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

1
2
- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type

Guarantees:

1
- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures

Constraints:

1
- Concrete subclasses must define the output type `T` and implement the `parse()` method

Initialize the HTML parser.

Parameters:

Name Type Description Default
content Content

HTML content to be parsed.

required
features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description
ValueError

If the content is empty or not valid HTML.

Attributes
supported_types class-attribute instance-attribute
supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

Functions
parse abstractmethod
parse() -> T

Fully parse the HTML content into structured output.

Returns:

Name Type Description
T T

Parsed representation of type T.

Notes

Responsibilities:

1
- Implementations must fully interpret the HTML DOM and return a deterministic, structured output
parse_div staticmethod
parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default
div Tag

BeautifulSoup tag representing a <div>.

required
separator str

String used to separate text nodes.

' '

Returns:

Name Type Description
str str

Flattened, whitespace-normalized text content.

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default
a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description
Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

parse_meta
parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

1
2
- Extract high-level metadata from the HTML document
- This includes: Document title, `<meta>` tag name/property → content mappings
parse_table staticmethod
parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default
table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description
list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

supports
supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name Type Description
bool bool

True if the content type is supported; False otherwise.

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

1
2
- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata

Constraints:

1
- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default
client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None
timeout float

Request timeout in seconds.

15.0
headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None
follow_redirects bool

Whether to follow HTTP redirects.

True
Functions
fetch
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default
source str

URL of the HTML document.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description
Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description
HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

validate_content_type
validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default
response Response

HTTP response returned by httpx.

required

Raises:

Type Description
ValueError

If the Content-Type header is missing or does not indicate HTML content.