Skip to content

Parser

omniread.pdf.parser

Summary

PDF parser base implementations for OmniRead.

This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.

PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.

Classes

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

Notes

Responsibilities:

1
2
- This class enforces PDF content-type compatibility and provides
  the extension point for implementing concrete PDF parsing strategies.

Constraints:

1
2
- Concrete implementations must define the output type `T` and
  implement the `parse()` method.

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default
content Content

Content instance to be parsed.

required

Raises:

Type Description
ValueError

If the content type is not supported by this parser.

Attributes
supported_types class-attribute instance-attribute
supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

Functions
parse abstractmethod
parse() -> T

Parse PDF content into a structured output.

Returns:

Name Type Description
T T

Parsed representation of type T.

Raises:

Type Description
Exception

Parsing-specific errors as defined by the implementation.

Notes

Responsibilities:

1
2
- Implementations must fully interpret the PDF binary payload and
  return a deterministic, structured output.
supports
supports() -> bool

Check whether this parser supports the content's type.

Returns:

Name Type Description
bool bool

True if the content type is supported; False otherwise.