Skip to content

Parser

omniread.html.parser

Summary

HTML parser base implementations for OmniRead.

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies:

  • Content-type enforcement for HTML inputs
  • BeautifulSoup initialization and lifecycle management
  • Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

Classes

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

1
2
3
4
- This class extends the core `BaseParser` with HTML-specific behavior,
  including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
  explicitly define the return type.

Guarantees:

1
2
3
- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.

Constraints:

1
2
- Concrete subclasses must define the output type `T` and implement
  the `parse()` method.

Initialize the HTML parser.

Parameters:

Name Type Description Default
content Content

HTML content to be parsed.

required
features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description
ValueError

If the content is empty or not valid HTML.

Attributes
supported_types class-attribute instance-attribute
supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

Functions
parse abstractmethod
parse() -> T

Fully parse the HTML content into structured output.

Returns:

Name Type Description
T T

Parsed representation of type T.

Notes

Responsibilities:

1
2
- Implementations must fully interpret the HTML DOM and return a
  deterministic, structured output.
parse_div staticmethod
parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default
div Tag

BeautifulSoup tag representing a <div>.

required
separator str

String used to separate text nodes.

' '

Returns:

Name Type Description
str str

Flattened, whitespace-normalized text content.

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default
a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description
Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

parse_meta
parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

1
2
3
- Extract high-level metadata from the HTML document.
- This includes: Document title, `<meta>` tag name/property to
  content mappings.
parse_table staticmethod
parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default
table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description
list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

supports
supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name Type Description
bool bool

True if the content type is supported; False otherwise.