Html
omniread.html
Summary
HTML format implementation for OmniRead.
This package provides HTML-specific implementations of the core OmniRead
contracts defined in omniread.core.
It includes:
- HTML parsers that interpret HTML content.
- HTML scrapers that retrieve HTML documents.
Key characteristics:
- Implements, but does not redefine, core contracts.
- May contain HTML-specific behavior and edge-case handling.
- Produces canonical content models defined in
omniread.core.content.
Consumers should depend on omniread.core interfaces wherever possible and
use this package only when HTML-specific behavior is required.
Public API
HTMLScraperHTMLParser
Classes
HTMLParser
Bases: BaseParser[T], Generic[T]
Base HTML parser.
Notes
Responsibilities:
1 2 3 4 | |
Guarantees:
1 2 3 | |
Constraints:
1 2 | |
Initialize the HTML parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
HTML content to be parsed. |
required |
features |
str
|
BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml'). |
'html.parser'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content is empty or not valid HTML. |
Attributes
supported_types
class-attribute
instance-attribute
Set of content types supported by this parser (HTML only).
Functions
parse
abstractmethod
Fully parse the HTML content into structured output.
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
Parsed representation of type |
Notes
Responsibilities:
1 2 | |
parse_div
staticmethod
Extract normalized text from a <div> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
div |
Tag
|
BeautifulSoup tag representing a |
required |
separator |
str
|
String used to separate text nodes. |
' '
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Flattened, whitespace-normalized text content. |
parse_link
staticmethod
Extract the hyperlink reference from an <a> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
Tag
|
BeautifulSoup tag representing an anchor. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]:
The value of the |
parse_meta
Extract high-level metadata from the HTML document.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Dictionary containing extracted metadata. |
Notes
Responsibilities:
1 2 3 | |
parse_table
staticmethod
Parse an HTML table into a 2D list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table |
Tag
|
BeautifulSoup tag representing a |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
list[list[str]]: A list of rows, where each row is a list of cell text values. |
supports
Check whether this parser supports the content's type.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the content type is supported; False otherwise. |
HTMLScraper
Bases: BaseScraper
Base HTML scraper using httpx.
Notes
Responsibilities:
1 2 3 4 5 | |
Constraints:
1 2 | |
Initialize the HTML scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
Client | None
|
Optional pre-configured |
None
|
timeout |
float
|
Request timeout in seconds. |
15.0
|
headers |
Optional[Mapping[str, str]]
|
Optional default HTTP headers. |
None
|
follow_redirects |
bool
|
Whether to follow HTTP redirects. |
True
|
Functions
fetch
Fetch an HTML document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
URL of the HTML document. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to be merged into the returned content. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Content |
Content
|
A |
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the HTTP request fails. |
ValueError
|
If the response is not valid HTML. |
validate_content_type
Validate that the HTTP response contains HTML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response |
Response
|
HTTP response returned by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the |