Scraper

omniread.pdf.scraper

PDF scraping implementation for OmniRead.

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

PDFScraper(*, client: BasePDFClient)

Scraper for PDF sources.

Notes

Responsibilities:

- Delegates byte retrieval to a PDF client and normalizes output into Content
- Preserves caller-provided metadata

Constraints:

- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend

Initialize the PDF scraper.

Parameters:

Name	Type	Description	Default
`client`	`BasePDFClient`	PDF client responsible for retrieving raw PDF bytes.	required

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`Any`	Identifier of the PDF source as understood by the configured PDF client.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to attach to the returned content.	`None`

Returns:

Name	Type	Description
`Content`	`Content`	A `Content` instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type	Description
`Exception`	Retrieval-specific errors raised by the PDF client.