omniread

Summary

OmniRead — format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure:

Clear boundaries between IO and interpretation.
Replaceable implementations per format.
Predictable, testable behavior.

Installation

Install OmniRead using pip:

1	`pip install omniread`

Install OmniRead using Poetry:

1	`poetry add omniread`

Quick start

Example

HTML example:

from omniread import HTMLScraper, HTMLParser

scraper = HTMLScraper()
content = scraper.fetch("https://example.com")

class TitleParser(HTMLParser[str]):
    def parse(self) -> str:
        return self._soup.title.string

parser = TitleParser(content)
title = parser.parse()

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path

client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))

class TextPDFParser(PDFParser[str]):
    def parse(self) -> str:
        # implement PDF text extraction
        ...

parser = TextPDFParser(content)
result = parser.parse()

Public API

This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Content: Canonical content model.
ContentType: Supported media types.
HTMLScraper: HTTP-based HTML acquisition.
HTMLParser: Base parser for HTML DOM interpretation.
FileSystemPDFClient: Local filesystem PDF access.
PDFScraper: PDF-specific content acquisition.
PDFParser: Base parser for PDF binary interpretation.

Core Philosophy

OmniRead is designed as a decoupled content engine:

Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

Classes

Content `dataclass`

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with
  minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
  parsers, and downstream consumers.

Attributes

content_type `class-attribute` `instance-attribute`

content_type: Optional[ContentType] = None

Optional MIME type of the content, if known.

metadata `class-attribute` `instance-attribute`

metadata: Optional[Mapping[str, Any]] = None

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

raw `instance-attribute`

raw: bytes

Raw content bytes as retrieved from the source.

source `instance-attribute`

source: str

Identifier of the content origin (URL, file path, or logical name).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the
  content source.
- It is primarily used for routing content to the appropriate
  parser or downstream consumer.

Attributes

HTML `class-attribute` `instance-attribute`

HTML = 'text/html'

HTML document content.

JSON `class-attribute` `instance-attribute`

JSON = 'application/json'

JSON document content.

PDF `class-attribute` `instance-attribute`

PDF = 'application/pdf'

PDF document content.

XML `class-attribute` `instance-attribute`

XML = 'application/xml'

XML document content.

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns
  their raw binary contents.

Functions

fetch

fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Filesystem path to the PDF file.	required

Returns:

Name	Type	Description
`bytes`	`bytes`	Raw PDF bytes.

Raises:

Type	Description
`FileNotFoundError`	If the path does not exist.
`ValueError`	If the path exists but is not a file.

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior,
  including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
  explicitly define the return type.

Guarantees:

- Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.

Constraints:

- Concrete subclasses must define the output type `T` and implement
  the `parse()` method.

Initialize the HTML parser.

Parameters:

Name	Type	Description	Default
`content`	`Content`	HTML content to be parsed.	required
`features`	`str`	BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').	`'html.parser'`

Raises:

Type	Description
`ValueError`	If the content is empty or not valid HTML.

Attributes

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

Functions

parse `abstractmethod`

parse() -> T

Fully parse the HTML content into structured output.

Returns:

Name	Type	Description
`T`	`T`	Parsed representation of type `T`.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a
  deterministic, structured output.

parse_div `staticmethod`

parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name	Type	Description	Default
`div`	`Tag`	BeautifulSoup tag representing a `<div>`.	required
`separator`	`str`	String used to separate text nodes.	`' '`

Returns:

Name	Type	Description
`str`	`str`	Flattened, whitespace-normalized text content.

parse_link `staticmethod`

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name	Type	Description	Default
`a`	`Tag`	BeautifulSoup tag representing an anchor.	required

Returns:

Type	Description
`Optional[str]`	Optional[str]: The value of the `href` attribute, or None if absent.

parse_meta

parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document.
- This includes: Document title, `<meta>` tag name/property to
  content mappings.

parse_table `staticmethod`

parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name	Type	Description	Default
`table`	`Tag`	BeautifulSoup tag representing a `<table>`.	required

Returns:

Type	Description
`list[list[str]]`	list[list[str]]: A list of rows, where each row is a list of cell text values.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name	Type	Description
`bool`	`bool`	True if the content type is supported; False otherwise.

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns
  them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
  HTML content type, and preserves HTTP response metadata.

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff,
  handle non-HTML responses.

Initialize the HTML scraper.

Parameters:

Name	Type	Description	Default
`client`	`Client \| None`	Optional pre-configured `httpx.Client`. If omitted, a client is created internally.	`None`
`timeout`	`float`	Request timeout in seconds.	`15.0`
`headers`	`Optional[Mapping[str, str]]`	Optional default HTTP headers.	`None`
`follow_redirects`	`bool`	Whether to follow HTTP redirects.	`True`

Functions

fetch

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`str`	URL of the HTML document.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to be merged into the returned content.	`None`

Returns:

Name	Type	Description
`Content`	`Content`	A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type	Description
`HTTPError`	If the HTTP request fails.
`ValueError`	If the response is not valid HTML.

validate_content_type

validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name	Type	Description	Default
`response`	`Response`	HTTP response returned by `httpx`.	required

Raises:

Type	Description
`ValueError`	If the `Content-Type` header is missing or does not indicate HTML content.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides
  the extension point for implementing concrete PDF parsing strategies.

Constraints:

- Concrete implementations must define the output type `T` and
  implement the `parse()` method.

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

Attributes

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

Functions

parse `abstractmethod`

parse() -> T

Parse PDF content into a structured output.

Returns:

Name	Type	Description
`T`	`T`	Parsed representation of type `T`.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and
  return a deterministic, structured output.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name	Type	Description
`bool`	`bool`	True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output
  into `Content`.
- Preserves caller-provided metadata.

Constraints:

- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.

Initialize the PDF scraper.

Parameters:

Name	Type	Description	Default
`client`	`BasePDFClient`	PDF client responsible for retrieving raw PDF bytes.	required

Functions

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`Any`	Identifier of the PDF source as understood by the configured PDF client.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to attach to the returned content.	`None`

Returns:

Name	Type	Description
`Content`	`Content`	A `Content` instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type	Description
`Exception`	Retrieval-specific errors raised by the PDF client.

omniread

omniread

Summary

Installation

Quick start

Public API

Core Philosophy

Classes

Content dataclass

Attributes

content_type class-attribute instance-attribute

metadata class-attribute instance-attribute

raw instance-attribute

source instance-attribute

ContentType

Attributes

HTML class-attribute instance-attribute

JSON class-attribute instance-attribute

PDF class-attribute instance-attribute

XML class-attribute instance-attribute

FileSystemPDFClient

Functions

fetch

HTMLParser

Attributes

supported_types class-attribute instance-attribute

Functions

parse abstractmethod

parse_div staticmethod

parse_link staticmethod

parse_meta

parse_table staticmethod

supports

HTMLScraper

Functions

fetch

validate_content_type

PDFParser

Attributes

supported_types class-attribute instance-attribute

Functions

parse abstractmethod

supports

PDFScraper

Functions

fetch

Content `dataclass`

content_type `class-attribute` `instance-attribute`

metadata `class-attribute` `instance-attribute`

raw `instance-attribute`

source `instance-attribute`

HTML `class-attribute` `instance-attribute`

JSON `class-attribute` `instance-attribute`

PDF `class-attribute` `instance-attribute`

XML `class-attribute` `instance-attribute`

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`

parse_div `staticmethod`

parse_link `staticmethod`

parse_table `staticmethod`

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`