omniread
omniread
OmniRead — format-agnostic content acquisition and parsing framework.
Summary
OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.
The library is structured around three core concepts:
- Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
- Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
- Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior
Installation
Install OmniRead using pip:
1 | |
Or with Poetry:
1 | |
Quick start
HTML example:
1 2 3 4 5 6 7 8 9 10 11 | |
PDF example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Public API
This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.
Core: - Content - ContentType
HTML: - HTMLScraper - HTMLParser
PDF: - FileSystemPDFClient - PDFScraper - PDFParser
Core Philosophy:
OmniRead is designed as a decoupled content engine:
1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.
Classes
Content
dataclass
Normalized representation of extracted content.
Notes
Responsibilities:
1 2 | |
Attributes
content_type
class-attribute
instance-attribute
Optional MIME type of the content, if known.
metadata
class-attribute
instance-attribute
Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
source
instance-attribute
Identifier of the content origin (URL, file path, or logical name).
ContentType
Bases: str, Enum
Supported MIME types for extracted content.
Notes
Guarantees:
1 2 | |
Attributes
FileSystemPDFClient
Bases: BasePDFClient
PDF client that reads from the local filesystem.
Notes
Guarantees:
1 | |
Functions
fetch
Read a PDF file from the local filesystem.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
Path
|
Filesystem path to the PDF file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
Raw PDF bytes. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the path does not exist. |
ValueError
|
If the path exists but is not a file. |
HTMLParser
Bases: BaseParser[T], Generic[T]
Base HTML parser.
Notes
Responsibilities:
1 2 | |
Guarantees:
1 | |
Constraints:
1 | |
Initialize the HTML parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
HTML content to be parsed. |
required |
features |
str
|
BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml'). |
'html.parser'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content is empty or not valid HTML. |
Attributes
supported_types
class-attribute
instance-attribute
Set of content types supported by this parser (HTML only).
Functions
parse
abstractmethod
Fully parse the HTML content into structured output.
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
Parsed representation of type |
Notes
Responsibilities:
1 | |
parse_div
staticmethod
Extract normalized text from a <div> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
div |
Tag
|
BeautifulSoup tag representing a |
required |
separator |
str
|
String used to separate text nodes. |
' '
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Flattened, whitespace-normalized text content. |
parse_link
staticmethod
Extract the hyperlink reference from an <a> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
Tag
|
BeautifulSoup tag representing an anchor. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]:
The value of the |
parse_meta
Extract high-level metadata from the HTML document.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Dictionary containing extracted metadata. |
Notes
Responsibilities:
1 2 | |
parse_table
staticmethod
Parse an HTML table into a 2D list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table |
Tag
|
BeautifulSoup tag representing a |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
list[list[str]]: A list of rows, where each row is a list of cell text values. |
supports
Check whether this parser supports the content's type.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the content type is supported; False otherwise. |
HTMLScraper
Bases: BaseScraper
Base HTML scraper using httpx.
Notes
Responsibilities:
1 2 | |
Constraints:
1 | |
Initialize the HTML scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
Client | None
|
Optional pre-configured |
None
|
timeout |
float
|
Request timeout in seconds. |
15.0
|
headers |
Optional[Mapping[str, str]]
|
Optional default HTTP headers. |
None
|
follow_redirects |
bool
|
Whether to follow HTTP redirects. |
True
|
Functions
fetch
Fetch an HTML document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
URL of the HTML document. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to be merged into the returned content. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Content |
Content
|
A |
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the HTTP request fails. |
ValueError
|
If the response is not valid HTML. |
validate_content_type
Validate that the HTTP response contains HTML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response |
Response
|
HTTP response returned by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the |
PDFParser
Bases: BaseParser[T], Generic[T]
Base PDF parser.
Notes
Responsibilities:
1 | |
Constraints:
1 | |
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
Attributes
supported_types
class-attribute
instance-attribute
Set of content types supported by this parser (PDF only).
Functions
parse
abstractmethod
Parse PDF content into a structured output.
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
Parsed representation of type |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
Notes
Responsibilities:
1 | |
supports
Check whether this parser supports the content's type.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the content type is supported; False otherwise. |
PDFScraper
Bases: BaseScraper
Scraper for PDF sources.
Notes
Responsibilities:
1 2 | |
Constraints:
1 | |
Initialize the PDF scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
BasePDFClient
|
PDF client responsible for retrieving raw PDF bytes. |
required |
Functions
fetch
Fetch a PDF document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
Any
|
Identifier of the PDF source as understood by the configured PDF client. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to attach to the returned content. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Content |
Content
|
A |
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors raised by the PDF client. |