{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"omniread","text":""},{"location":"#omniread","title":"omniread","text":"

OmniRead \u2014 format-agnostic content acquisition and parsing framework.

"},{"location":"#omniread--summary","title":"Summary","text":"

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

  1. Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
  2. Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
  3. Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior

"},{"location":"#omniread--installation","title":"Installation","text":"

Install OmniRead using pip:

pip install omniread\n

Or with Poetry:

poetry add omniread\n
"},{"location":"#omniread--quick-start","title":"Quick start","text":"

HTML example:

from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n    def parse(self) -> str:\n        return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n    def parse(self) -> str:\n        # implement PDF text extraction\n        ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n
"},{"location":"#omniread--public-api","title":"Public API","text":"

This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Core: - Content - ContentType

HTML: - HTMLScraper - HTMLParser

PDF: - FileSystemPDFClient - PDFScraper - PDFParser

Core Philosophy: OmniRead is designed as a decoupled content engine: 1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other. 2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract. 3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

"},{"location":"#omniread-classes","title":"Classes","text":""},{"location":"#omniread.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n
"},{"location":"#omniread.Content-attributes","title":"Attributes","text":""},{"location":"#omniread.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"#omniread.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"#omniread.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"#omniread.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"#omniread.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n
"},{"location":"#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"#omniread.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"#omniread.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"#omniread.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"#omniread.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns their raw binary contents\n
"},{"location":"#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"#omniread.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n

Guarantees:

- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n

Constraints:

- Concrete subclasses must define the output type `T` and implement the `parse()` method\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"#omniread.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"#omniread.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n
"},{"location":"#omniread.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"#omniread.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document\n- This includes: Document title, `<meta>` tag name/property \u2192 content mappings\n
"},{"location":"#omniread.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"#omniread.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"#omniread.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"#omniread.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"#omniread.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n

Constraints:

- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"#omniread.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"#omniread.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n
"},{"location":"#omniread.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"#omniread.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"#omniread.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"core/","title":"Core","text":""},{"location":"core/#omniread.core","title":"omniread.core","text":"

Core domain contracts for OmniRead.

"},{"location":"core/#omniread.core--summary","title":"Summary","text":"

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts

Format-specific behavior must not be introduced at this layer.

"},{"location":"core/#omniread.core--public-api","title":"Public API","text":"
Content\nContentType\n
"},{"location":"core/#omniread.core-classes","title":"Classes","text":""},{"location":"core/#omniread.core.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n

Responsibilities:

- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

"},{"location":"core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Returns:

Name Type Description T T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and return a deterministic, structured output\n
"},{"location":"core/#omniread.core.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"core/#omniread.core.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n

Constraints:

- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n
"},{"location":"core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.)

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Name Type Description Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n
"},{"location":"core/#omniread.core.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n
"},{"location":"core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"core/#omniread.core.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"core/#omniread.core.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"core/#omniread.core.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"core/#omniread.core.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n
"},{"location":"core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"core/#omniread.core.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"core/#omniread.core.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"core/#omniread.core.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"core/content/","title":"Content","text":""},{"location":"core/content/#omniread.core.content","title":"omniread.core.content","text":"

Canonical content models for OmniRead.

"},{"location":"core/content/#omniread.core.content--summary","title":"Summary","text":"

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

"},{"location":"core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"core/content/#omniread.core.content.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n
"},{"location":"core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"core/content/#omniread.core.content.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"core/content/#omniread.core.content.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"core/content/#omniread.core.content.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"core/content/#omniread.core.content.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n
"},{"location":"core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"core/content/#omniread.core.content.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"core/content/#omniread.core.content.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"core/content/#omniread.core.content.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"core/parser/","title":"Parser","text":""},{"location":"core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"

Abstract parsing contracts for OmniRead.

"},{"location":"core/parser/#omniread.core.parser--summary","title":"Summary","text":"

This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.

Parsers are responsible for: - Interpreting a single Content instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers

Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources

"},{"location":"core/parser/#omniread.core.parser-classes","title":"Classes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n

Responsibilities:

- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

"},{"location":"core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Returns:

Name Type Description T T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and return a deterministic, structured output\n
"},{"location":"core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"core/scraper/","title":"Scraper","text":""},{"location":"core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"

Abstract scraping contracts for OmniRead.

"},{"location":"core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"

This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.

Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized Content objects

Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing

All interpretation must be delegated to parsers.

"},{"location":"core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n

Constraints:

- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n
"},{"location":"core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.)

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Name Type Description Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n
"},{"location":"html/","title":"Html","text":""},{"location":"html/#omniread.html","title":"omniread.html","text":"

HTML format implementation for OmniRead.

"},{"location":"html/#omniread.html--summary","title":"Summary","text":"

This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.

It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents

This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in omniread.core.content

Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.

"},{"location":"html/#omniread.html--public-api","title":"Public API","text":"
HTMLScraper\nHTMLParser\n
"},{"location":"html/#omniread.html-classes","title":"Classes","text":""},{"location":"html/#omniread.html.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n

Guarantees:

- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n

Constraints:

- Concrete subclasses must define the output type `T` and implement the `parse()` method\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/#omniread.html.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n
"},{"location":"html/#omniread.html.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"html/#omniread.html.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document\n- This includes: Document title, `<meta>` tag name/property \u2192 content mappings\n
"},{"location":"html/#omniread.html.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"html/#omniread.html.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"html/parser/","title":"Parser","text":""},{"location":"html/parser/#omniread.html.parser","title":"omniread.html.parser","text":"

HTML parser base implementations for OmniRead.

"},{"location":"html/parser/#omniread.html.parser--summary","title":"Summary","text":"

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

"},{"location":"html/parser/#omniread.html.parser-classes","title":"Classes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n

Guarantees:

- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n

Constraints:

- Concrete subclasses must define the output type `T` and implement the `parse()` method\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n
"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document\n- This includes: Document title, `<meta>` tag name/property \u2192 content mappings\n
"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"html/scraper/","title":"Scraper","text":""},{"location":"html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"

HTML scraping implementation for OmniRead.

"},{"location":"html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

"},{"location":"html/scraper/#omniread.html.scraper-classes","title":"Classes","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/","title":"Omniread","text":""},{"location":"omniread/#omniread","title":"omniread","text":"

OmniRead \u2014 format-agnostic content acquisition and parsing framework.

"},{"location":"omniread/#omniread--summary","title":"Summary","text":"

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

  1. Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
  2. Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
  3. Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior

"},{"location":"omniread/#omniread--installation","title":"Installation","text":"

Install OmniRead using pip:

pip install omniread\n

Or with Poetry:

poetry add omniread\n
"},{"location":"omniread/#omniread--quick-start","title":"Quick start","text":"

HTML example:

from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n    def parse(self) -> str:\n        return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n    def parse(self) -> str:\n        # implement PDF text extraction\n        ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n
"},{"location":"omniread/#omniread--public-api","title":"Public API","text":"

This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Core: - Content - ContentType

HTML: - HTMLScraper - HTMLParser

PDF: - FileSystemPDFClient - PDFScraper - PDFParser

Core Philosophy: OmniRead is designed as a decoupled content engine: 1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other. 2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract. 3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

"},{"location":"omniread/#omniread-classes","title":"Classes","text":""},{"location":"omniread/#omniread.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n
"},{"location":"omniread/#omniread.Content-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"omniread/#omniread.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"omniread/#omniread.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"omniread/#omniread.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"omniread/#omniread.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n
"},{"location":"omniread/#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"omniread/#omniread.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"omniread/#omniread.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"omniread/#omniread.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"omniread/#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns their raw binary contents\n
"},{"location":"omniread/#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"omniread/#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"omniread/#omniread.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n

Guarantees:

- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n

Constraints:

- Concrete subclasses must define the output type `T` and implement the `parse()` method\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"omniread/#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"omniread/#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"omniread/#omniread.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n
"},{"location":"omniread/#omniread.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"omniread/#omniread.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"omniread/#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document\n- This includes: Document title, `<meta>` tag name/property \u2192 content mappings\n
"},{"location":"omniread/#omniread.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"omniread/#omniread.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/#omniread.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"omniread/#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"omniread/#omniread.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"omniread/#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/#omniread.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n

Constraints:

- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"omniread/#omniread.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"omniread/#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"omniread/#omniread.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n
"},{"location":"omniread/#omniread.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/#omniread.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"omniread/#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"omniread/#omniread.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"omniread/core/","title":"Core","text":""},{"location":"omniread/core/#omniread.core","title":"omniread.core","text":"

Core domain contracts for OmniRead.

"},{"location":"omniread/core/#omniread.core--summary","title":"Summary","text":"

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts

Format-specific behavior must not be introduced at this layer.

"},{"location":"omniread/core/#omniread.core--public-api","title":"Public API","text":"
Content\nContentType\n
"},{"location":"omniread/core/#omniread.core-classes","title":"Classes","text":""},{"location":"omniread/core/#omniread.core.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n

Responsibilities:

- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"omniread/core/#omniread.core.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

"},{"location":"omniread/core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"omniread/core/#omniread.core.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Returns:

Name Type Description T T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and return a deterministic, structured output\n
"},{"location":"omniread/core/#omniread.core.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/core/#omniread.core.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n

Constraints:

- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n
"},{"location":"omniread/core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"omniread/core/#omniread.core.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.)

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Name Type Description Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n
"},{"location":"omniread/core/#omniread.core.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n
"},{"location":"omniread/core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"omniread/core/#omniread.core.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"omniread/core/#omniread.core.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"omniread/core/#omniread.core.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"omniread/core/#omniread.core.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"omniread/core/#omniread.core.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n
"},{"location":"omniread/core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"omniread/core/#omniread.core.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"omniread/core/#omniread.core.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"omniread/core/#omniread.core.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"omniread/core/#omniread.core.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"omniread/core/content/","title":"Content","text":""},{"location":"omniread/core/content/#omniread.core.content","title":"omniread.core.content","text":"

Canonical content models for OmniRead.

"},{"location":"omniread/core/content/#omniread.core.content--summary","title":"Summary","text":"

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

"},{"location":"omniread/core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"omniread/core/content/#omniread.core.content.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers\n
"},{"location":"omniread/core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"omniread/core/content/#omniread.core.content.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"omniread/core/content/#omniread.core.content.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"omniread/core/content/#omniread.core.content.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"omniread/core/content/#omniread.core.content.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"omniread/core/content/#omniread.core.content.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the content source\n- It is primarily used for routing content to the appropriate parser or downstream consumer\n
"},{"location":"omniread/core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"omniread/core/content/#omniread.core.content.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"omniread/core/content/#omniread.core.content.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"omniread/core/parser/","title":"Parser","text":""},{"location":"omniread/core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"

Abstract parsing contracts for OmniRead.

"},{"location":"omniread/core/parser/#omniread.core.parser--summary","title":"Summary","text":"

This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.

Parsers are responsible for: - Interpreting a single Content instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers

Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources

"},{"location":"omniread/core/parser/#omniread.core.parser-classes","title":"Classes","text":""},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the Content it is responsible for interpreting\n- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n

Responsibilities:

- Implementations must declare supported content types via `supported_types`\n- Implementations must raise parsing-specific exceptions from `parse()`\n- Implementations must remain deterministic for a given input\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Returns:

Name Type Description T T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and return a deterministic, structured output\n
"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/core/scraper/","title":"Scraper","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"

Abstract scraping contracts for OmniRead.

"},{"location":"omniread/core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"

This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.

Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized Content objects

Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing

All interpretation must be delegated to parsers.

"},{"location":"omniread/core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n- Scrapers define how content is obtained, not what the content means\n- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n

Constraints:

- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser\n
"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.)

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Name Type Description Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object\n
"},{"location":"omniread/html/","title":"Html","text":""},{"location":"omniread/html/#omniread.html","title":"omniread.html","text":"

HTML format implementation for OmniRead.

"},{"location":"omniread/html/#omniread.html--summary","title":"Summary","text":"

This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.

It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents

This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in omniread.core.content

Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.

"},{"location":"omniread/html/#omniread.html--public-api","title":"Public API","text":"
HTMLScraper\nHTMLParser\n
"},{"location":"omniread/html/#omniread.html-classes","title":"Classes","text":""},{"location":"omniread/html/#omniread.html.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n

Guarantees:

- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n

Constraints:

- Concrete subclasses must define the output type `T` and implement the `parse()` method\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"omniread/html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"omniread/html/#omniread.html.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"omniread/html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"omniread/html/#omniread.html.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n
"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document\n- This includes: Document title, `<meta>` tag name/property \u2192 content mappings\n
"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"omniread/html/#omniread.html.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"omniread/html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"omniread/html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"omniread/html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/html/parser/","title":"Parser","text":""},{"location":"omniread/html/parser/#omniread.html.parser","title":"omniread.html.parser","text":"

HTML parser base implementations for OmniRead.

"},{"location":"omniread/html/parser/#omniread.html.parser--summary","title":"Summary","text":"

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

"},{"location":"omniread/html/parser/#omniread.html.parser-classes","title":"Classes","text":""},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n

Guarantees:

- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n

Constraints:

- Concrete subclasses must define the output type `T` and implement the `parse()` method\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a deterministic, structured output\n
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document\n- This includes: Document title, `<meta>` tag name/property \u2192 content mappings\n
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/html/scraper/","title":"Scraper","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"

HTML scraping implementation for OmniRead.

"},{"location":"omniread/html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

"},{"location":"omniread/html/scraper/#omniread.html.scraper-classes","title":"Classes","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"omniread/pdf/","title":"Pdf","text":""},{"location":"omniread/pdf/#omniread.pdf","title":"omniread.pdf","text":"

PDF format implementation for OmniRead.

"},{"location":"omniread/pdf/#omniread.pdf--summary","title":"Summary","text":"

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

"},{"location":"omniread/pdf/#omniread.pdf--public-api","title":"Public API","text":"
FileSystemPDFClient\nPDFScraper\nPDFParser\n
"},{"location":"omniread/pdf/#omniread.pdf-classes","title":"Classes","text":""},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns their raw binary contents\n
"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n

Constraints:

- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"omniread/pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"omniread/pdf/#omniread.pdf.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n
"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"omniread/pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"omniread/pdf/client/","title":"Client","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"

PDF client abstractions for OmniRead.

"},{"location":"omniread/pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"

This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.

Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.

Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems

"},{"location":"omniread/pdf/client/#omniread.pdf.client-classes","title":"Classes","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"

Bases: ABC

Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).

Notes

Responsibilities:

- Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure\n
"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch abstractmethod","text":"
fetch(source: Any) -> bytes\n

Fetch raw PDF bytes from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF location, such as a file path, object storage key, or remote reference.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description Exception

Retrieval-specific errors defined by the implementation.

"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns their raw binary contents\n
"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"omniread/pdf/parser/","title":"Parser","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"

PDF parser base implementations for OmniRead.

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"

This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.

PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n

Constraints:

- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n
"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"omniread/pdf/scraper/","title":"Scraper","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"

PDF scraping implementation for OmniRead.

"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper-classes","title":"Classes","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"pdf/","title":"Pdf","text":""},{"location":"pdf/#omniread.pdf","title":"omniread.pdf","text":"

PDF format implementation for OmniRead.

"},{"location":"pdf/#omniread.pdf--summary","title":"Summary","text":"

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

"},{"location":"pdf/#omniread.pdf--public-api","title":"Public API","text":"
FileSystemPDFClient\nPDFScraper\nPDFParser\n
"},{"location":"pdf/#omniread.pdf-classes","title":"Classes","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns their raw binary contents\n
"},{"location":"pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n

Constraints:

- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n
"},{"location":"pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"pdf/client/","title":"Client","text":""},{"location":"pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"

PDF client abstractions for OmniRead.

"},{"location":"pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"

This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.

Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.

Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems

"},{"location":"pdf/client/#omniread.pdf.client-classes","title":"Classes","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"

Bases: ABC

Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).

Notes

Responsibilities:

- Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure\n
"},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch abstractmethod","text":"
fetch(source: Any) -> bytes\n

Fetch raw PDF bytes from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF location, such as a file path, object storage key, or remote reference.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description Exception

Retrieval-specific errors defined by the implementation.

"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns their raw binary contents\n
"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"pdf/parser/","title":"Parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"

PDF parser base implementations for OmniRead.

"},{"location":"pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"

This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.

PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.

"},{"location":"pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n

Constraints:

- Concrete implementations must: Define the output type `T`, implement the `parse()` method\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output\n
"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"pdf/scraper/","title":"Scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"

PDF scraping implementation for OmniRead.

"},{"location":"pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

"},{"location":"pdf/scraper/#omniread.pdf.scraper-classes","title":"Classes","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content\n- Preserves caller-provided metadata\n

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"}]}