{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"omniread","text":""},{"location":"#omniread","title":"omniread","text":""},{"location":"#omniread--summary","title":"Summary","text":"

OmniRead \u2014 format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

  1. Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
  2. Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
  3. Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure:

"},{"location":"#omniread--installation","title":"Installation","text":"

Install OmniRead using pip:

pip install omniread\n

Install OmniRead using Poetry:

poetry add omniread\n

"},{"location":"#omniread--quick-start","title":"Quick start","text":"Example

HTML example:

from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n    def parse(self) -> str:\n        return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n    def parse(self) -> str:\n        # implement PDF text extraction\n        ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n

"},{"location":"#omniread--public-api","title":"Public API","text":"

This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

"},{"location":"#omniread--core-philosophy","title":"Core Philosophy","text":"

OmniRead is designed as a decoupled content engine:

  1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
  2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
  3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.
"},{"location":"#omniread-classes","title":"Classes","text":""},{"location":"#omniread.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with\n  minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n  parsers, and downstream consumers.\n
"},{"location":"#omniread.Content-attributes","title":"Attributes","text":""},{"location":"#omniread.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"#omniread.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"#omniread.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"#omniread.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"#omniread.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the\n  content source.\n- It is primarily used for routing content to the appropriate\n  parser or downstream consumer.\n
"},{"location":"#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"#omniread.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"#omniread.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"#omniread.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"#omniread.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns\n  their raw binary contents.\n
"},{"location":"#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"#omniread.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior,\n  including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n  explicitly define the return type.\n

Guarantees:

- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n

Constraints:

- Concrete subclasses must define the output type `T` and implement\n  the `parse()` method.\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"#omniread.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"#omniread.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a\n  deterministic, structured output.\n
"},{"location":"#omniread.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"#omniread.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"#omniread.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document.\n- This includes: Document title, `<meta>` tag name/property to\n  content mappings.\n
"},{"location":"#omniread.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"#omniread.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"#omniread.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns\n  them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n  HTML content type, and preserves HTTP response metadata.\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff,\n  handle non-HTML responses.\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"#omniread.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"#omniread.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides\n  the extension point for implementing concrete PDF parsing strategies.\n

Constraints:

- Concrete implementations must define the output type `T` and\n  implement the `parse()` method.\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"#omniread.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"#omniread.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and\n  return a deterministic, structured output.\n
"},{"location":"#omniread.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"#omniread.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output\n  into `Content`.\n- Preserves caller-provided metadata.\n

Constraints:

- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"#omniread.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"core/","title":"Core","text":""},{"location":"core/#omniread.core","title":"omniread.core","text":""},{"location":"core/#omniread.core--summary","title":"Summary","text":"

Core domain contracts for OmniRead.

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules:

Format-specific behavior must not be introduced at this layer.

"},{"location":"core/#omniread.core--public-api","title":"Public API","text":""},{"location":"core/#omniread.core-classes","title":"Classes","text":""},{"location":"core/#omniread.core.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the `Content` it is\n  responsible for interpreting.\n- Consumers may rely on early validation of content compatibility\n  and type-stable return values from `parse()`.\n

Responsibilities:

- Implementations must declare supported content types via `supported_types`.\n- Implementations must raise parsing-specific exceptions from `parse()`.\n- Implementations must remain deterministic for a given input.\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

"},{"location":"core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Returns:

Name Type Description T T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and\n  return a deterministic, structured output.\n
"},{"location":"core/#omniread.core.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"core/#omniread.core.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes)\n  from a source. It must not interpret or parse it.\n- A scraper is a stateless acquisition component that retrieves raw\n  content from a source and returns it as a `Content` object.\n- Scrapers define how content is obtained, not what the content means.\n- Implementations may vary in transport mechanism, authentication\n  strategy, retry and backoff behavior.\n

Constraints:

- Implementations must not parse content, modify content semantics,\n  or couple scraping logic to a specific parser.\n
"},{"location":"core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.).

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.).

None

Returns:

Name Type Description Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source`\n  and return it as raw bytes wrapped in a `Content` object.\n
"},{"location":"core/#omniread.core.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with\n  minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n  parsers, and downstream consumers.\n
"},{"location":"core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"core/#omniread.core.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"core/#omniread.core.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"core/#omniread.core.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"core/#omniread.core.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the\n  content source.\n- It is primarily used for routing content to the appropriate\n  parser or downstream consumer.\n
"},{"location":"core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"core/#omniread.core.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"core/#omniread.core.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"core/#omniread.core.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"core/content/","title":"Content","text":""},{"location":"core/content/#omniread.core.content","title":"omniread.core.content","text":""},{"location":"core/content/#omniread.core.content--summary","title":"Summary","text":"

Canonical content models for OmniRead.

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

"},{"location":"core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"core/content/#omniread.core.content.Content","title":"Content dataclass","text":"
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n

Normalized representation of extracted content.

Notes

Responsibilities:

- A `Content` instance represents a raw content payload along with\n  minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n  parsers, and downstream consumers.\n
"},{"location":"core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.Content.content_type","title":"content_type class-attribute instance-attribute","text":"
content_type: Optional[ContentType] = None\n

Optional MIME type of the content, if known.

"},{"location":"core/content/#omniread.core.content.Content.metadata","title":"metadata class-attribute instance-attribute","text":"
metadata: Optional[Mapping[str, Any]] = None\n

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

"},{"location":"core/content/#omniread.core.content.Content.raw","title":"raw instance-attribute","text":"
raw: bytes\n

Raw content bytes as retrieved from the source.

"},{"location":"core/content/#omniread.core.content.Content.source","title":"source instance-attribute","text":"
source: str\n

Identifier of the content origin (URL, file path, or logical name).

"},{"location":"core/content/#omniread.core.content.ContentType","title":"ContentType","text":"

Bases: str, Enum

Supported MIME types for extracted content.

Notes

Guarantees:

- This enum represents the declared or inferred media type of the\n  content source.\n- It is primarily used for routing content to the appropriate\n  parser or downstream consumer.\n
"},{"location":"core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"
HTML = 'text/html'\n

HTML document content.

"},{"location":"core/content/#omniread.core.content.ContentType.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'application/json'\n

JSON document content.

"},{"location":"core/content/#omniread.core.content.ContentType.PDF","title":"PDF class-attribute instance-attribute","text":"
PDF = 'application/pdf'\n

PDF document content.

"},{"location":"core/content/#omniread.core.content.ContentType.XML","title":"XML class-attribute instance-attribute","text":"
XML = 'application/xml'\n

XML document content.

"},{"location":"core/parser/","title":"Parser","text":""},{"location":"core/parser/#omniread.core.parser","title":"omniread.core.parser","text":""},{"location":"core/parser/#omniread.core.parser--summary","title":"Summary","text":"

Abstract parsing contracts for OmniRead.

This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.

Parsers are responsible for:

Parsers are not responsible for:

"},{"location":"core/parser/#omniread.core.parser-classes","title":"Classes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"
BaseParser(content: Content)\n

Bases: ABC, Generic[T]

Base interface for all parsers.

Notes

Guarantees:

- A parser is a self-contained object that owns the `Content` it is\n  responsible for interpreting.\n- Consumers may rely on early validation of content compatibility\n  and type-stable return values from `parse()`.\n

Responsibilities:

- Implementations must declare supported content types via `supported_types`.\n- Implementations must raise parsing-specific exceptions from `parse()`.\n- Implementations must remain deterministic for a given input.\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: Set[ContentType] = set()\n

Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.

"},{"location":"core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse the owned content into structured output.

Returns:

Name Type Description T T

Parsed, structured representation.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully consume the provided content and\n  return a deterministic, structured output.\n
"},{"location":"core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"core/scraper/","title":"Scraper","text":""},{"location":"core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":""},{"location":"core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"

Abstract scraping contracts for OmniRead.

This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.

Scrapers are responsible for:

Scrapers are explicitly NOT responsible for:

All interpretation must be delegated to parsers.

"},{"location":"core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":"

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

- A scraper is responsible ONLY for fetching raw content (bytes)\n  from a source. It must not interpret or parse it.\n- A scraper is a stateless acquisition component that retrieves raw\n  content from a source and returns it as a `Content` object.\n- Scrapers define how content is obtained, not what the content means.\n- Implementations may vary in transport mechanism, authentication\n  strategy, retry and backoff behavior.\n

Constraints:

- Implementations must not parse content, modify content semantics,\n  or couple scraping logic to a specific parser.\n
"},{"location":"core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch abstractmethod","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch raw content from the given source.

Parameters:

Name Type Description Default source str

Location identifier (URL, file path, S3 URI, etc.).

required metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.).

None

Returns:

Name Type Description Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must retrieve the content referenced by `source`\n  and return it as raw bytes wrapped in a `Content` object.\n
"},{"location":"html/","title":"Html","text":""},{"location":"html/#omniread.html","title":"omniread.html","text":""},{"location":"html/#omniread.html--summary","title":"Summary","text":"

HTML format implementation for OmniRead.

This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.

It includes:

Key characteristics:

Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.

"},{"location":"html/#omniread.html--public-api","title":"Public API","text":""},{"location":"html/#omniread.html-classes","title":"Classes","text":""},{"location":"html/#omniread.html.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior,\n  including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n  explicitly define the return type.\n

Guarantees:

- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n

Constraints:

- Concrete subclasses must define the output type `T` and implement\n  the `parse()` method.\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/#omniread.html.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a\n  deterministic, structured output.\n
"},{"location":"html/#omniread.html.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"html/#omniread.html.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"html/#omniread.html.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document.\n- This includes: Document title, `<meta>` tag name/property to\n  content mappings.\n
"},{"location":"html/#omniread.html.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"html/#omniread.html.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns\n  them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n  HTML content type, and preserves HTTP response metadata.\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff,\n  handle non-HTML responses.\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"html/parser/","title":"Parser","text":""},{"location":"html/parser/#omniread.html.parser","title":"omniread.html.parser","text":""},{"location":"html/parser/#omniread.html.parser--summary","title":"Summary","text":"

HTML parser base implementations for OmniRead.

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies:

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

"},{"location":"html/parser/#omniread.html.parser-classes","title":"Classes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser","title":"HTMLParser","text":"
HTMLParser(content: Content, features: str = 'html.parser')\n

Bases: BaseParser[T], Generic[T]

Base HTML parser.

Notes

Responsibilities:

- This class extends the core `BaseParser` with HTML-specific behavior,\n  including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n  explicitly define the return type.\n

Guarantees:

- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n

Constraints:

- Concrete subclasses must define the output type `T` and implement\n  the `parse()` method.\n

Initialize the HTML parser.

Parameters:

Name Type Description Default content Content

HTML content to be parsed.

required features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description ValueError

If the content is empty or not valid HTML.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {HTML}\n

Set of content types supported by this parser (HTML only).

"},{"location":"html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Fully parse the HTML content into structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Notes

Responsibilities:

- Implementations must fully interpret the HTML DOM and return a\n  deterministic, structured output.\n
"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div staticmethod","text":"
parse_div(div: Tag, *, separator: str = ' ') -> str\n

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default div Tag

BeautifulSoup tag representing a <div>.

required separator str

String used to separate text nodes.

' '

Returns:

Name Type Description str str

Flattened, whitespace-normalized text content.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_link staticmethod","text":"
parse_link(a: Tag) -> Optional[str]\n

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description Optional[str]

Optional[str]: The value of the href attribute, or None if absent.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_meta","title":"parse_meta","text":"
parse_meta() -> dict[str, Any]\n

Extract high-level metadata from the HTML document.

Returns:

Type Description dict[str, Any]

dict[str, Any]: Dictionary containing extracted metadata.

Notes

Responsibilities:

- Extract high-level metadata from the HTML document.\n- This includes: Document title, `<meta>` tag name/property to\n  content mappings.\n
"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table staticmethod","text":"
parse_table(table: Tag) -> list[list[str]]\n

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description list[list[str]]

list[list[str]]: A list of rows, where each row is a list of cell text values.

"},{"location":"html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"html/scraper/","title":"Scraper","text":""},{"location":"html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":""},{"location":"html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"

HTML scraping implementation for OmniRead.

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for:

This scraper is not responsible for:

"},{"location":"html/scraper/#omniread.html.scraper-classes","title":"Classes","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n

Bases: BaseScraper

Base HTML scraper using httpx.

Notes

Responsibilities:

- This scraper retrieves HTML documents over HTTP(S) and returns\n  them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n  HTML content type, and preserves HTTP response metadata.\n

Constraints:

- The scraper does not: Parse HTML, perform retries or backoff,\n  handle non-HTML responses.\n

Initialize the HTML scraper.

Parameters:

Name Type Description Default client Client | None

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None timeout float

Request timeout in seconds.

15.0 headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None follow_redirects bool

Whether to follow HTTP redirects.

True"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default source str

URL of the HTML document.

required metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.

Raises:

Type Description HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"
validate_content_type(response: httpx.Response) -> None\n

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default response Response

HTTP response returned by httpx.

required

Raises:

Type Description ValueError

If the Content-Type header is missing or does not indicate HTML content.

"},{"location":"pdf/","title":"Pdf","text":""},{"location":"pdf/#omniread.pdf","title":"omniread.pdf","text":""},{"location":"pdf/#omniread.pdf--summary","title":"Summary","text":"

PDF format implementation for OmniRead.

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes:

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

"},{"location":"pdf/#omniread.pdf--public-api","title":"Public API","text":""},{"location":"pdf/#omniread.pdf-classes","title":"Classes","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns\n  their raw binary contents.\n
"},{"location":"pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides\n  the extension point for implementing concrete PDF parsing strategies.\n

Constraints:

- Concrete implementations must define the output type `T` and\n  implement the `parse()` method.\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and\n  return a deterministic, structured output.\n
"},{"location":"pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output\n  into `Content`.\n- Preserves caller-provided metadata.\n

Constraints:

- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"},{"location":"pdf/client/","title":"Client","text":""},{"location":"pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":""},{"location":"pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"

PDF client abstractions for OmniRead.

This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.

Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.

Typical backing stores include:

"},{"location":"pdf/client/#omniread.pdf.client-classes","title":"Classes","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":"

Bases: ABC

Abstract client responsible for retrieving PDF bytes.

Retrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).

Notes

Responsibilities:

- Implementations must accept a source identifier appropriate to\n  the backing store.\n- Return the full PDF binary payload.\n- Raise retrieval-specific errors on failure.\n
"},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch abstractmethod","text":"
fetch(source: Any) -> bytes\n

Fetch raw PDF bytes from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF location, such as a file path, object storage key, or remote reference.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description Exception

Retrieval-specific errors defined by the implementation.

"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":"

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns\n  their raw binary contents.\n
"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"
fetch(path: Path) -> bytes\n

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description bytes bytes

Raw PDF bytes.

Raises:

Type Description FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

"},{"location":"pdf/parser/","title":"Parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"

PDF parser base implementations for OmniRead.

This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.

PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.

"},{"location":"pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"
PDFParser(content: Content)\n

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides\n  the extension point for implementing concrete PDF parsing strategies.\n

Constraints:

- Concrete implementations must define the output type `T` and\n  implement the `parse()` method.\n

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default content Content

Content instance to be parsed.

required

Raises:

Type Description ValueError

If the content type is not supported by this parser.

"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_types class-attribute instance-attribute","text":"
supported_types: set[ContentType] = {PDF}\n

Set of content types supported by this parser (PDF only).

"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parse abstractmethod","text":"
parse() -> T\n

Parse PDF content into a structured output.

Returns:

Name Type Description T T

Parsed representation of type T.

Raises:

Type Description Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and\n  return a deterministic, structured output.\n
"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"
supports() -> bool\n

Check whether this parser supports the content's type.

Returns:

Name Type Description bool bool

True if the content type is supported; False otherwise.

"},{"location":"pdf/scraper/","title":"Scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"

PDF scraping implementation for OmniRead.

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

"},{"location":"pdf/scraper/#omniread.pdf.scraper-classes","title":"Classes","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper","title":"PDFScraper","text":"
PDFScraper(*, client: BasePDFClient)\n

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output\n  into `Content`.\n- Preserves caller-provided metadata.\n

Constraints:

- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n

Initialize the PDF scraper.

Parameters:

Name Type Description Default client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required"},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default source Any

Identifier of the PDF source as understood by the configured PDF client.

required metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description Exception

Retrieval-specific errors raised by the PDF client.

"}]}