omniread.pdf
Summary
PDF format implementation for OmniRead.
This package provides PDF-specific implementations of the core OmniRead
contracts defined in omniread.core.
Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes:
- PDF clients for acquiring raw PDF data.
- PDF scrapers that coordinate client access.
- PDF parsers that extract structured content from PDF binaries.
Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.
Public API
FileSystemPDFClientPDFScraperPDFParser
Classes
FileSystemPDFClient
Bases: BasePDFClient
PDF client that reads from the local filesystem.
Notes
Guarantees:
1 2 | |
Functions
fetch
Read a PDF file from the local filesystem.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
Path
|
Filesystem path to the PDF file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
Raw PDF bytes. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the path does not exist. |
ValueError
|
If the path exists but is not a file. |
PDFParser
Bases: BaseParser[T], Generic[T]
Base PDF parser.
Notes
Responsibilities:
1 2 | |
Constraints:
1 2 | |
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
Attributes
supported_types
class-attribute
instance-attribute
Set of content types supported by this parser (PDF only).
Functions
parse
abstractmethod
Parse PDF content into a structured output.
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
Parsed representation of type |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
Notes
Responsibilities:
1 2 | |
supports
Check whether this parser supports the content's type.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the content type is supported; False otherwise. |
PDFScraper
Bases: BaseScraper
Scraper for PDF sources.
Notes
Responsibilities:
1 2 3 | |
Constraints:
1 2 | |
Initialize the PDF scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
BasePDFClient
|
PDF client responsible for retrieving raw PDF bytes. |
required |
Functions
fetch
Fetch a PDF document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
Any
|
Identifier of the PDF source as understood by the configured PDF client. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to attach to the returned content. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Content |
Content
|
A |
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors raised by the PDF client. |