{ "module": "omniread.pdf", "content": { "path": "omniread.pdf", "docstring": "# Summary\n\nPDF format implementation for OmniRead.\n\nThis package provides **PDF-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nUnlike HTML, PDF handling requires an explicit client layer for document\naccess. This package therefore includes:\n\n- PDF clients for acquiring raw PDF data.\n- PDF scrapers that coordinate client access.\n- PDF parsers that extract structured content from PDF binaries.\n\nPublic exports from this package represent the supported PDF pipeline\nand are safe for consumers to import directly when working with PDFs.\n\n---\n\n# Public API\n\n- `FileSystemPDFClient`\n- `PDFScraper`\n- `PDFParser`\n\n---", "objects": { "FileSystemPDFClient": { "name": "FileSystemPDFClient", "kind": "class", "path": "omniread.pdf.FileSystemPDFClient", "signature": "", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.FileSystemPDFClient.fetch", "signature": "", "docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file." } } }, "PDFScraper": { "name": "PDFScraper", "kind": "class", "path": "omniread.pdf.PDFScraper", "signature": "", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.PDFScraper.fetch", "signature": "", "docstring": "Fetch a PDF document from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF source as understood by the configured PDF client.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to attach to the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors raised by the PDF client." } } }, "PDFParser": { "name": "PDFParser", "kind": "class", "path": "omniread.pdf.PDFParser", "signature": "", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.", "members": { "supported_types": { "name": "supported_types", "kind": "attribute", "path": "omniread.pdf.PDFParser.supported_types", "signature": "", "docstring": "Set of content types supported by this parser (PDF only)." }, "parse": { "name": "parse", "kind": "function", "path": "omniread.pdf.PDFParser.parse", "signature": "", "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output." } } }, "client": { "name": "client", "kind": "module", "path": "omniread.pdf.client", "signature": null, "docstring": "# Summary\n\nPDF client abstractions for OmniRead.\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems", "members": { "Any": { "name": "Any", "kind": "alias", "path": "omniread.pdf.client.Any", "signature": "", "docstring": null }, "ABC": { "name": "ABC", "kind": "alias", "path": "omniread.pdf.client.ABC", "signature": "", "docstring": null }, "abstractmethod": { "name": "abstractmethod", "kind": "alias", "path": "omniread.pdf.client.abstractmethod", "signature": "", "docstring": null }, "Path": { "name": "Path", "kind": "alias", "path": "omniread.pdf.client.Path", "signature": "", "docstring": null }, "BasePDFClient": { "name": "BasePDFClient", "kind": "class", "path": "omniread.pdf.client.BasePDFClient", "signature": "", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.client.BasePDFClient.fetch", "signature": "", "docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation." } } }, "FileSystemPDFClient": { "name": "FileSystemPDFClient", "kind": "class", "path": "omniread.pdf.client.FileSystemPDFClient", "signature": "", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.client.FileSystemPDFClient.fetch", "signature": "", "docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file." } } } } }, "parser": { "name": "parser", "kind": "module", "path": "omniread.pdf.parser", "signature": null, "docstring": "# Summary\n\nPDF parser base implementations for OmniRead.\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.", "members": { "Generic": { "name": "Generic", "kind": "alias", "path": "omniread.pdf.parser.Generic", "signature": "", "docstring": null }, "TypeVar": { "name": "TypeVar", "kind": "alias", "path": "omniread.pdf.parser.TypeVar", "signature": "", "docstring": null }, "abstractmethod": { "name": "abstractmethod", "kind": "alias", "path": "omniread.pdf.parser.abstractmethod", "signature": "", "docstring": null }, "ContentType": { "name": "ContentType", "kind": "class", "path": "omniread.pdf.parser.ContentType", "signature": "", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.", "members": { "HTML": { "name": "HTML", "kind": "attribute", "path": "omniread.pdf.parser.ContentType.HTML", "signature": "", "docstring": "HTML document content." }, "PDF": { "name": "PDF", "kind": "attribute", "path": "omniread.pdf.parser.ContentType.PDF", "signature": "", "docstring": "PDF document content." }, "JSON": { "name": "JSON", "kind": "attribute", "path": "omniread.pdf.parser.ContentType.JSON", "signature": "", "docstring": "JSON document content." }, "XML": { "name": "XML", "kind": "attribute", "path": "omniread.pdf.parser.ContentType.XML", "signature": "", "docstring": "XML document content." } } }, "BaseParser": { "name": "BaseParser", "kind": "class", "path": "omniread.pdf.parser.BaseParser", "signature": "", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.", "members": { "supported_types": { "name": "supported_types", "kind": "attribute", "path": "omniread.pdf.parser.BaseParser.supported_types", "signature": "", "docstring": "Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic." }, "content": { "name": "content", "kind": "attribute", "path": "omniread.pdf.parser.BaseParser.content", "signature": "", "docstring": null }, "parse": { "name": "parse", "kind": "function", "path": "omniread.pdf.parser.BaseParser.parse", "signature": "", "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output." }, "supports": { "name": "supports", "kind": "function", "path": "omniread.pdf.parser.BaseParser.supports", "signature": "", "docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise." } } }, "T": { "name": "T", "kind": "attribute", "path": "omniread.pdf.parser.T", "signature": null, "docstring": null }, "PDFParser": { "name": "PDFParser", "kind": "class", "path": "omniread.pdf.parser.PDFParser", "signature": "", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.", "members": { "supported_types": { "name": "supported_types", "kind": "attribute", "path": "omniread.pdf.parser.PDFParser.supported_types", "signature": null, "docstring": "Set of content types supported by this parser (PDF only)." }, "parse": { "name": "parse", "kind": "function", "path": "omniread.pdf.parser.PDFParser.parse", "signature": "", "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output." } } } } }, "scraper": { "name": "scraper", "kind": "module", "path": "omniread.pdf.scraper", "signature": null, "docstring": "# Summary\n\nPDF scraping implementation for OmniRead.\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.", "members": { "Any": { "name": "Any", "kind": "alias", "path": "omniread.pdf.scraper.Any", "signature": "", "docstring": null }, "Mapping": { "name": "Mapping", "kind": "alias", "path": "omniread.pdf.scraper.Mapping", "signature": "", "docstring": null }, "Optional": { "name": "Optional", "kind": "alias", "path": "omniread.pdf.scraper.Optional", "signature": "", "docstring": null }, "Content": { "name": "Content", "kind": "class", "path": "omniread.pdf.scraper.Content", "signature": "", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.", "members": { "raw": { "name": "raw", "kind": "attribute", "path": "omniread.pdf.scraper.Content.raw", "signature": "", "docstring": "Raw content bytes as retrieved from the source." }, "source": { "name": "source", "kind": "attribute", "path": "omniread.pdf.scraper.Content.source", "signature": "", "docstring": "Identifier of the content origin (URL, file path, or logical name)." }, "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.pdf.scraper.Content.content_type", "signature": "", "docstring": "Optional MIME type of the content, if known." }, "metadata": { "name": "metadata", "kind": "attribute", "path": "omniread.pdf.scraper.Content.metadata", "signature": "", "docstring": "Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes)." } } }, "ContentType": { "name": "ContentType", "kind": "class", "path": "omniread.pdf.scraper.ContentType", "signature": "", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.", "members": { "HTML": { "name": "HTML", "kind": "attribute", "path": "omniread.pdf.scraper.ContentType.HTML", "signature": "", "docstring": "HTML document content." }, "PDF": { "name": "PDF", "kind": "attribute", "path": "omniread.pdf.scraper.ContentType.PDF", "signature": "", "docstring": "PDF document content." }, "JSON": { "name": "JSON", "kind": "attribute", "path": "omniread.pdf.scraper.ContentType.JSON", "signature": "", "docstring": "JSON document content." }, "XML": { "name": "XML", "kind": "attribute", "path": "omniread.pdf.scraper.ContentType.XML", "signature": "", "docstring": "XML document content." } } }, "BaseScraper": { "name": "BaseScraper", "kind": "class", "path": "omniread.pdf.scraper.BaseScraper", "signature": "", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.scraper.BaseScraper.fetch", "signature": "", "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object." } } }, "BasePDFClient": { "name": "BasePDFClient", "kind": "class", "path": "omniread.pdf.scraper.BasePDFClient", "signature": "", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.scraper.BasePDFClient.fetch", "signature": "", "docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation." } } }, "PDFScraper": { "name": "PDFScraper", "kind": "class", "path": "omniread.pdf.scraper.PDFScraper", "signature": "", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.pdf.scraper.PDFScraper.fetch", "signature": "", "docstring": "Fetch a PDF document from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF source as understood by the configured PDF client.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to attach to the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors raised by the PDF client." } } } } } } } }