{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"omniread/","title":"Omniread","text":""},{"location":"omniread/#omniread","title":"omniread","text":"

OmniRead \u2014 format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

  1. Content A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.

  2. Scrapers Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.

  3. Parsers Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior

"},{"location":"omniread/#omniread--installation","title":"Installation","text":"

Install OmniRead using pip:

pip install omniread\n

Or with Poetry:

poetry add omniread\n
"},{"location":"omniread/#omniread--basic-usage","title":"Basic Usage","text":"

HTML example:

from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n    def parse(self) -> str:\n        return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n    def parse(self) -> str:\n        # implement PDF text extraction\n        ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n
"},{"location":"omniread/#omniread--public-api-surface","title":"Public API Surface","text":"

This module re-exports the recommended public entry points of OmniRead.

Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Core: - Content - ContentType

HTML: - HTMLScraper - HTMLParser

PDF: - FileSystemPDFClient - PDFScraper - PDFParser

"},{"location":"omniread/#omniread--core-philosophy","title":"Core Philosophy","text":"

OmniRead is designed as a decoupled content engine:

  1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
  2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
  3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.
"},{"location":"omniread/#omniread--documentation-design","title":"Documentation Design","text":"

For those extending OmniRead, follow these \"AI-Native\" docstring principles:

"},{"location":"omniread/#omniread--for-humans","title":"For Humans","text":""},{"location":"omniread/#omniread--for-llms","title":"For LLMs","text":""},{"location":"omniread/#omniread.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name Type Description raw bytes

Raw content bytes as retrieved from the source.

source str

Identifier of the content origin (URL, file path, or logical name).

content_type Optional[ContentType]

Optional MIME type of the content, if known.

metadata Optional[Mapping[str, Any]]

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"omniread/#omniread.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

"},{"location":"omniread/#omniread.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"omniread/#omniread.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"omniread/#omniread.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"omniread/#omniread.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"omniread/#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

"},{"location":"omniread/#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Type Description bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"omniread/#omniread.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"omniread/#omniread.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"omniread/#omniread.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type Description T

Parsed representation of type T.

"},{"location":"omniread/#omniread.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Type Description str

Flattened, whitespace-normalized text content.

"},{"location":"omniread/#omniread.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

The value of the href attribute, or None if absent.

"},{"location":"omniread/#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property \u2192 content mappings

Returns:

Type Description dict[str, Any]

Dictionary containing extracted metadata.

"},{"location":"omniread/#omniread.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

A list of rows, where each row is a list of cell text values.

"},{"location":"omniread/#omniread.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/#omniread.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Optional[Client]

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"omniread/#omniread.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Type Description Content

A Content instance containing:

Content Content Content Content

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"omniread/#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/#omniread.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/#omniread.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"omniread/#omniread.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type Description T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

"},{"location":"omniread/#omniread.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/#omniread.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"omniread/#omniread.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Type Description Content

A Content instance containing:

Content Content Content Content

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"omniread/core/","title":"Core","text":""},{"location":"omniread/core/#omniread.core","title":"omniread.core","text":"

Core domain contracts for OmniRead.

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts

Format-specific behavior must not be introduced at this layer.

"},{"location":"omniread/core/#omniread.core.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

A parser is a self-contained object that owns the Content it is responsible for interpreting.

Implementations must: - Declare supported content types via supported_types - Raise parsing-specific exceptions from parse() - Remain deterministic for a given input

Consumers may rely on: - Early validation of content compatibility - Type-stable return values from parse()

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/core/#omniread.core.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser.

An empty set indicates that the parser is content-type agnostic.

"},{"location":"omniread/core/#omniread.core.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Implementations must fully consume the provided content and return a deterministic, structured output.

Returns:

Type Description T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

"},{"location":"omniread/core/#omniread.core.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/core/#omniread.core.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.

A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.

Scrapers define how content is obtained, not what the content means.

Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior

Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser

"},{"location":"omniread/core/#omniread.core.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.)

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Type Description Content

Content object containing raw bytes and metadata.

Content Content Content

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

"},{"location":"omniread/core/#omniread.core.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name Type Description raw bytes

Raw content bytes as retrieved from the source.

source str

Identifier of the content origin (URL, file path, or logical name).

content_type Optional[ContentType]

Optional MIME type of the content, if known.

metadata Optional[Mapping[str, Any]]

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"omniread/core/#omniread.core.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

"},{"location":"omniread/core/#omniread.core.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"omniread/core/#omniread.core.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"omniread/core/#omniread.core.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"omniread/core/#omniread.core.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"omniread/core/content/","title":"Content","text":""},{"location":"omniread/core/content/#omniread.core.content","title":"omniread.core.content","text":"

Canonical content models for OmniRead.

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

"},{"location":"omniread/core/content/#omniread.core.content.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name Type Description raw bytes

Raw content bytes as retrieved from the source.

source str

Identifier of the content origin (URL, file path, or logical name).

content_type Optional[ContentType]

Optional MIME type of the content, if known.

metadata Optional[Mapping[str, Any]]

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"omniread/core/content/#omniread.core.content.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"omniread/core/parser/","title":"Parser","text":""},{"location":"omniread/core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"

Abstract parsing contracts for OmniRead.

This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.

Parsers are responsible for: - Interpreting a single Content instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers

Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources

"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

A parser is a self-contained object that owns the Content it is responsible for interpreting.

Implementations must: - Declare supported content types via supported_types - Raise parsing-specific exceptions from parse() - Remain deterministic for a given input

Consumers may rely on: - Early validation of content compatibility - Type-stable return values from parse()

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser.

An empty set indicates that the parser is content-type agnostic.

"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Implementations must fully consume the provided content and return a deterministic, structured output.

Returns:

Type Description T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/core/scraper/","title":"Scraper","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"

Abstract scraping contracts for OmniRead.

This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.

Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized Content objects

Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing

All interpretation must be delegated to parsers.

"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.

A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.

Scrapers define how content is obtained, not what the content means.

Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior

Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser

"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.)

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Type Description Content

Content object containing raw bytes and metadata.

Content Content Content

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

"},{"location":"omniread/html/","title":"Html","text":""},{"location":"omniread/html/#omniread.html","title":"omniread.html","text":"

HTML format implementation for OmniRead.

This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.

It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents

This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in omniread.core.content

Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.

"},{"location":"omniread/html/#omniread.html.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"omniread/html/#omniread.html.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type Description T

Parsed representation of type T.

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Type Description str

Flattened, whitespace-normalized text content.

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

The value of the href attribute, or None if absent.

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property \u2192 content mappings

Returns:

Type Description dict[str, Any]

Dictionary containing extracted metadata.

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

A list of rows, where each row is a list of cell text values.

"},{"location":"omniread/html/#omniread.html.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Optional[Client]

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"omniread/html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Type Description Content

A Content instance containing:

Content Content Content Content

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"omniread/html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/html/parser/","title":"Parser","text":""},{"location":"omniread/html/parser/#omniread.html.parser","title":"omniread.html.parser","text":"

HTML parser base implementations for OmniRead.

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type Description T

Parsed representation of type T.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Type Description str

Flattened, whitespace-normalized text content.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

The value of the href attribute, or None if absent.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property \u2192 content mappings

Returns:

Type Description dict[str, Any]

Dictionary containing extracted metadata.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

A list of rows, where each row is a list of cell text values.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/html/scraper/","title":"Scraper","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"

HTML scraping implementation for OmniRead.

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Optional[Client]

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Type Description Content

A Content instance containing:

Content Content Content Content

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/pdf/","title":"Pdf","text":""},{"location":"omniread/pdf/#omniread.pdf","title":"omniread.pdf","text":"

PDF format implementation for OmniRead.

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Type Description bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type Description T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Type Description Content

A Content instance containing:

Content Content Content Content

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"omniread/pdf/client/","title":"Client","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"

PDF client abstractions for OmniRead.

This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.

Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.

Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems

"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"

Bases: ABC

Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).

Implementations must: - Accept a source identifier appropriate to the backing store - Return the full PDF binary payload - Raise retrieval-specific errors on failure

"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch abstractmethod","text":"
fetch(source: Any) -> bytes\n

Fetch raw PDF bytes from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF location, such as a file path, object storage key, or remote reference.

required

Returns:

Type Description bytes

Raw PDF bytes.

Raises:

Type Description Exception

Retrieval-specific errors defined by the implementation.

"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Type Description bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"omniread/pdf/parser/","title":"Parser","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"

PDF parser base implementations for OmniRead.

This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.

PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type Description T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Type Description bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/pdf/scraper/","title":"Scraper","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"

PDF scraping implementation for OmniRead.

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Type Description Content

A Content instance containing:

Content Content Content Content

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"}]}