Pdf

omniread.pdf

Summary

PDF format implementation for OmniRead.

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes:

PDF clients for acquiring raw PDF data.
PDF scrapers that coordinate client access.
PDF parsers that extract structured content from PDF binaries.

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

Public API

FileSystemPDFClient
PDFScraper
PDFParser

Classes

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

Notes

Guarantees:

- This client reads PDF files directly from the disk and returns
  their raw binary contents.

Functions

fetch

fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Filesystem path to the PDF file.	required

Returns:

Name	Type	Description
`bytes`	`bytes`	Raw PDF bytes.

Raises:

Type	Description
`FileNotFoundError`	If the path does not exist.
`ValueError`	If the path exists but is not a file.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

- This class enforces PDF content-type compatibility and provides
  the extension point for implementing concrete PDF parsing strategies.

Constraints:

- Concrete implementations must define the output type `T` and
  implement the `parse()` method.

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

Attributes

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

Functions

parse `abstractmethod`

parse() -> T

Parse PDF content into a structured output.

Returns:

Name	Type	Description
`T`	`T`	Parsed representation of type `T`.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

- Implementations must fully interpret the PDF binary payload and
  return a deterministic, structured output.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name	Type	Description
`bool`	`bool`	True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output
  into `Content`.
- Preserves caller-provided metadata.

Constraints:

- The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.

Initialize the PDF scraper.

Parameters:

Name	Type	Description	Default
`client`	`BasePDFClient`	PDF client responsible for retrieving raw PDF bytes.	required

Functions

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`Any`	Identifier of the PDF source as understood by the configured PDF client.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to attach to the returned content.	`None`

Returns:

Name	Type	Description
`Content`	`Content`	A `Content` instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type	Description
`Exception`	Retrieval-specific errors raised by the PDF client.

Pdf

omniread.pdf

Summary

Public API

Classes

FileSystemPDFClient

Functions

fetch

PDFParser

Attributes

supported_types class-attribute instance-attribute

Functions

parse abstractmethod

supports

PDFScraper

Functions

fetch

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`