Skip to content

Scraper

omniread.pdf.scraper

PDF scraping implementation for OmniRead.


Summary

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

Classes

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Notes

Responsibilities:

1
2
- Delegates byte retrieval to a PDF client and normalizes output into Content
- Preserves caller-provided metadata

Constraints:

1
- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend

Initialize the PDF scraper.

Parameters:

Name Type Description Default
client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required
Functions
fetch
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default
source Any

Identifier of the PDF source as understood by the configured PDF client.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Name Type Description
Content Content

A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.

Raises:

Type Description
Exception

Retrieval-specific errors raised by the PDF client.