Parser
omniread.html.parser
Summary
HTML parser base implementations for OmniRead.
This module provides reusable HTML parsing utilities built on top of
the abstract parser contracts defined in omniread.core.parser.
It supplies:
- Content-type enforcement for HTML inputs
- BeautifulSoup initialization and lifecycle management
- Common helper methods for extracting structured data from HTML elements
Concrete parsers must subclass HTMLParser and implement the parse() method
to return a structured representation appropriate for their use case.
Classes
HTMLParser
Bases: BaseParser[T], Generic[T]
Base HTML parser.
Notes
Responsibilities:
1 2 3 4 | |
Guarantees:
1 2 3 | |
Constraints:
1 2 | |
Initialize the HTML parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
HTML content to be parsed. |
required |
features |
str
|
BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml'). |
'html.parser'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content is empty or not valid HTML. |
Attributes
supported_types
class-attribute
instance-attribute
Set of content types supported by this parser (HTML only).
Functions
parse
abstractmethod
Fully parse the HTML content into structured output.
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
Parsed representation of type |
Notes
Responsibilities:
1 2 | |
parse_div
staticmethod
Extract normalized text from a <div> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
div |
Tag
|
BeautifulSoup tag representing a |
required |
separator |
str
|
String used to separate text nodes. |
' '
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Flattened, whitespace-normalized text content. |
parse_link
staticmethod
Extract the hyperlink reference from an <a> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
Tag
|
BeautifulSoup tag representing an anchor. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]:
The value of the |
parse_meta
Extract high-level metadata from the HTML document.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Dictionary containing extracted metadata. |
Notes
Responsibilities:
1 2 3 | |
parse_table
staticmethod
Parse an HTML table into a 2D list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table |
Tag
|
BeautifulSoup tag representing a |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
list[list[str]]: A list of rows, where each row is a list of cell text values. |
supports
Check whether this parser supports the content's type.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the content type is supported; False otherwise. |