Skip to content

Pdf

omniread.pdf

Summary

PDF format implementation for OmniRead.

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes:

  • PDF clients for acquiring raw PDF data.
  • PDF scrapers that coordinate client access.
  • PDF parsers that extract structured content from PDF binaries.

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.


Public API

  • FileSystemPDFClient
  • PDFScraper
  • PDFParser

Classes

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

1
2
- This client reads PDF files directly from the disk and returns
  their raw binary contents.
Functions
fetch
fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default
path Path

Filesystem path to the PDF file.

required

Returns:

Name Type Description
bytes bytes

Raw PDF bytes.

Raises:

Type Description
FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

1
2
- This class enforces PDF content-type compatibility and provides
  the extension point for implementing concrete PDF parsing strategies.

Constraints:

1
2
- Concrete implementations must define the output type `T` and
  implement the `parse()` method.

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default
content Content

Content instance to be parsed.

required

Raises:

Type Description
ValueError

If the content type is not supported by this parser.

Attributes
supported_types class-attribute instance-attribute
supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

Functions
parse abstractmethod
parse() -> T

Parse PDF content into a structured output.

Returns:

Name Type Description
T T

Parsed representation of type T.

Raises:

Type Description
Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

1
2
- Implementations must fully interpret the PDF binary payload and
  return a deterministic, structured output.
supports
supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name Type Description
bool bool

True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

1
2
3
- Delegates byte retrieval to a PDF client and normalizes output
  into `Content`.
- Preserves caller-provided metadata.

Constraints:

1
2
- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.

Initialize the PDF scraper.

Parameters:

Name Type Description Default
client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required
Functions
fetch
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default
source Any

Identifier of the PDF source as understood by the configured PDF client.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description
Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description
Exception

Retrieval-specific errors raised by the PDF client.