Parser

omniread.html.parser

Summary

HTML parser base implementations for OmniRead.

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies:

Content-type enforcement for HTML inputs
BeautifulSoup initialization and lifecycle management
Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

Classes

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior,
  including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
  explicitly define the return type.

Guarantees:

- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.

Constraints:

- Concrete subclasses must define the output type `T` and implement
  the `parse()` method.

Initialize the HTML parser.

Parameters:

Name	Type	Description	Default
`content`	`Content`	HTML content to be parsed.	required
`features`	`str`	BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').	`'html.parser'`

Raises:

Type	Description
`ValueError`	If the content is empty or not valid HTML.

Attributes

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

Functions

parse `abstractmethod`

parse() -> T

Fully parse the HTML content into structured output.

Returns:

Name	Type	Description
`T`	`T`	Parsed representation of type `T`.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a
  deterministic, structured output.

parse_div `staticmethod`

parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name	Type	Description	Default
`div`	`Tag`	BeautifulSoup tag representing a `<div>`.	required
`separator`	`str`	String used to separate text nodes.	`' '`

Returns:

Name	Type	Description
`str`	`str`	Flattened, whitespace-normalized text content.

parse_link `staticmethod`

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name	Type	Description	Default
`a`	`Tag`	BeautifulSoup tag representing an anchor.	required

Returns:

Type	Description
`Optional[str]`	Optional[str]: The value of the `href` attribute, or None if absent.

parse_meta

parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document.
- This includes: Document title, `<meta>` tag name/property to
  content mappings.

parse_table `staticmethod`

parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name	Type	Description	Default
`table`	`Tag`	BeautifulSoup tag representing a `<table>`.	required

Returns:

Type	Description
`list[list[str]]`	list[list[str]]: A list of rows, where each row is a list of cell text values.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name	Type	Description
`bool`	`bool`	True if the content type is supported; False otherwise.

Parser

omniread.html.parser

Summary

Classes

HTMLParser

Attributes

supported_types class-attribute instance-attribute

Functions

parse abstractmethod

parse_div staticmethod

parse_link staticmethod

parse_meta

parse_table staticmethod

supports

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`

parse_div `staticmethod`

parse_link `staticmethod`

parse_table `staticmethod`