{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"omniread","text":""},{"location":"#omniread","title":"omniread","text":""},{"location":"#omniread--summary","title":"Summary","text":"
OmniRead \u2014 format-agnostic content acquisition and parsing framework.
OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.
The library is structured around three core concepts:
Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.OmniRead deliberately separates these responsibilities to ensure:
Install OmniRead using pip:
pip install omniread\n Install OmniRead using Poetry:
poetry add omniread\n"},{"location":"#omniread--quick-start","title":"Quick start","text":"Example HTML example:
from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n def parse(self) -> str:\n return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n PDF example:
from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n def parse(self) -> str:\n # implement PDF text extraction\n ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n"},{"location":"#omniread--public-api","title":"Public API","text":"This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.
Content: Canonical content model.ContentType: Supported media types.HTMLScraper: HTTP-based HTML acquisition.HTMLParser: Base parser for HTML DOM interpretation.FileSystemPDFClient: Local filesystem PDF access.PDFScraper: PDF-specific content acquisition.PDFParser: Base parser for PDF binary interpretation.OmniRead is designed as a decoupled content engine:
Content model, ensuring a consistent contract.dataclass","text":"Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n Normalized representation of extracted content.
NotesResponsibilities:
- A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.\n"},{"location":"#omniread.Content-attributes","title":"Attributes","text":""},{"location":"#omniread.Content.content_type","title":"content_type class-attribute instance-attribute","text":"content_type: Optional[ContentType] = None\n Optional MIME type of the content, if known.
"},{"location":"#omniread.Content.metadata","title":"metadataclass-attribute instance-attribute","text":"metadata: Optional[Mapping[str, Any]] = None\n Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
"},{"location":"#omniread.Content.raw","title":"rawinstance-attribute","text":"raw: bytes\n Raw content bytes as retrieved from the source.
"},{"location":"#omniread.Content.source","title":"sourceinstance-attribute","text":"source: str\n Identifier of the content origin (URL, file path, or logical name).
"},{"location":"#omniread.ContentType","title":"ContentType","text":" Bases: str, Enum
Supported MIME types for extracted content.
NotesGuarantees:
- This enum represents the declared or inferred media type of the\n content source.\n- It is primarily used for routing content to the appropriate\n parser or downstream consumer.\n"},{"location":"#omniread.ContentType-attributes","title":"Attributes","text":""},{"location":"#omniread.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"HTML = 'text/html'\n HTML document content.
"},{"location":"#omniread.ContentType.JSON","title":"JSONclass-attribute instance-attribute","text":"JSON = 'application/json'\n JSON document content.
"},{"location":"#omniread.ContentType.PDF","title":"PDFclass-attribute instance-attribute","text":"PDF = 'application/pdf'\n PDF document content.
"},{"location":"#omniread.ContentType.XML","title":"XMLclass-attribute instance-attribute","text":"XML = 'application/xml'\n XML document content.
"},{"location":"#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":" Bases: BasePDFClient
PDF client that reads from the local filesystem.
NotesGuarantees:
- This client reads PDF files directly from the disk and returns\n their raw binary contents.\n"},{"location":"#omniread.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"fetch(path: Path) -> bytes\n Read a PDF file from the local filesystem.
Parameters:
Name Type Description Defaultpath Path Filesystem path to the PDF file.
requiredReturns:
Name Type Descriptionbytes bytes Raw PDF bytes.
Raises:
Type DescriptionFileNotFoundError If the path does not exist.
ValueError If the path exists but is not a file.
"},{"location":"#omniread.HTMLParser","title":"HTMLParser","text":"HTMLParser(content: Content, features: str = 'html.parser')\n Bases: BaseParser[T], Generic[T]
Base HTML parser.
NotesResponsibilities:
- This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n Guarantees:
- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n Constraints:
- Concrete subclasses must define the output type `T` and implement\n the `parse()` method.\n Initialize the HTML parser.
Parameters:
Name Type Description Defaultcontent Content HTML content to be parsed.
requiredfeatures str BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').
'html.parser' Raises:
Type DescriptionValueError If the content is empty or not valid HTML.
"},{"location":"#omniread.HTMLParser-attributes","title":"Attributes","text":""},{"location":"#omniread.HTMLParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {HTML}\n Set of content types supported by this parser (HTML only).
"},{"location":"#omniread.HTMLParser-functions","title":"Functions","text":""},{"location":"#omniread.HTMLParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Fully parse the HTML content into structured output.
Returns:
Name Type DescriptionT T Parsed representation of type T.
Responsibilities:
- Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output.\n"},{"location":"#omniread.HTMLParser.parse_div","title":"parse_div staticmethod","text":"parse_div(div: Tag, *, separator: str = ' ') -> str\n Extract normalized text from a <div> element.
Parameters:
Name Type Description Defaultdiv Tag BeautifulSoup tag representing a <div>.
separator str String used to separate text nodes.
' ' Returns:
Name Type Descriptionstr str Flattened, whitespace-normalized text content.
"},{"location":"#omniread.HTMLParser.parse_link","title":"parse_linkstaticmethod","text":"parse_link(a: Tag) -> Optional[str]\n Extract the hyperlink reference from an <a> element.
Parameters:
Name Type Description Defaulta Tag BeautifulSoup tag representing an anchor.
requiredReturns:
Type DescriptionOptional[str] Optional[str]: The value of the href attribute, or None if absent.
parse_meta() -> dict[str, Any]\n Extract high-level metadata from the HTML document.
Returns:
Type Descriptiondict[str, Any] dict[str, Any]: Dictionary containing extracted metadata.
NotesResponsibilities:
- Extract high-level metadata from the HTML document.\n- This includes: Document title, `<meta>` tag name/property to\n content mappings.\n"},{"location":"#omniread.HTMLParser.parse_table","title":"parse_table staticmethod","text":"parse_table(table: Tag) -> list[list[str]]\n Parse an HTML table into a 2D list of strings.
Parameters:
Name Type Description Defaulttable Tag BeautifulSoup tag representing a <table>.
Returns:
Type Descriptionlist[list[str]] list[list[str]]: A list of rows, where each row is a list of cell text values.
"},{"location":"#omniread.HTMLParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"#omniread.HTMLScraper","title":"HTMLScraper","text":"HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n Bases: BaseScraper
Base HTML scraper using httpx.
Responsibilities:
- This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n Constraints:
- The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.\n Initialize the HTML scraper.
Parameters:
Name Type Description Defaultclient Client | None Optional pre-configured httpx.Client. If omitted, a client is created internally.
None timeout float Request timeout in seconds.
15.0 headers Optional[Mapping[str, str]] Optional default HTTP headers.
None follow_redirects bool Whether to follow HTTP redirects.
True"},{"location":"#omniread.HTMLScraper-functions","title":"Functions","text":""},{"location":"#omniread.HTMLScraper.fetch","title":"fetch","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch an HTML document from the given source.
Parameters:
Name Type Description Defaultsource str URL of the HTML document.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to be merged into the returned content.
None Returns:
Name Type DescriptionContent Content A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.
Raises:
Type DescriptionHTTPError If the HTTP request fails.
ValueError If the response is not valid HTML.
"},{"location":"#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"validate_content_type(response: httpx.Response) -> None\n Validate that the HTTP response contains HTML content.
Parameters:
Name Type Description Defaultresponse Response HTTP response returned by httpx.
Raises:
Type DescriptionValueError If the Content-Type header is missing or does not indicate HTML content.
PDFParser(content: Content)\n Bases: BaseParser[T], Generic[T]
Base PDF parser.
NotesResponsibilities:
- This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n Constraints:
- Concrete implementations must define the output type `T` and\n implement the `parse()` method.\n Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"#omniread.PDFParser-attributes","title":"Attributes","text":""},{"location":"#omniread.PDFParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {PDF}\n Set of content types supported by this parser (PDF only).
"},{"location":"#omniread.PDFParser-functions","title":"Functions","text":""},{"location":"#omniread.PDFParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse PDF content into a structured output.
Returns:
Name Type DescriptionT T Parsed representation of type T.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output.\n"},{"location":"#omniread.PDFParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"#omniread.PDFScraper","title":"PDFScraper","text":"PDFScraper(*, client: BasePDFClient)\n Bases: BaseScraper
Scraper for PDF sources.
NotesResponsibilities:
- Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n- Preserves caller-provided metadata.\n Constraints:
- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n Initialize the PDF scraper.
Parameters:
Name Type Description Defaultclient BasePDFClient PDF client responsible for retrieving raw PDF bytes.
required"},{"location":"#omniread.PDFScraper-functions","title":"Functions","text":""},{"location":"#omniread.PDFScraper.fetch","title":"fetch","text":"fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch a PDF document from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF source as understood by the configured PDF client.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to attach to the returned content.
None Returns:
Name Type DescriptionContent Content A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.
Raises:
Type DescriptionException Retrieval-specific errors raised by the PDF client.
"},{"location":"core/","title":"Core","text":""},{"location":"core/#omniread.core","title":"omniread.core","text":""},{"location":"core/#omniread.core--summary","title":"Summary","text":"Core domain contracts for OmniRead.
This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).
Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.
Submodules:
content: Canonical content models and enums.parser: Abstract parsing contracts.scraper: Abstract scraping contracts.Format-specific behavior must not be introduced at this layer.
"},{"location":"core/#omniread.core--public-api","title":"Public API","text":"ContentContentTypeBaseParser(content: Content)\n Bases: ABC, Generic[T]
Base interface for all parsers.
NotesGuarantees:
- A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n- Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n Responsibilities:
- Implementations must declare supported content types via `supported_types`.\n- Implementations must raise parsing-specific exceptions from `parse()`.\n- Implementations must remain deterministic for a given input.\n Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"core/#omniread.core.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.BaseParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: Set[ContentType] = set()\n Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.
"},{"location":"core/#omniread.core.BaseParser-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse the owned content into structured output.
Returns:
Name Type DescriptionT T Parsed, structured representation.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must fully consume the provided content and\n return a deterministic, structured output.\n"},{"location":"core/#omniread.core.BaseParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"core/#omniread.core.BaseScraper","title":"BaseScraper","text":" Bases: ABC
Base interface for all scrapers.
NotesResponsibilities:
- A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n- A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n- Scrapers define how content is obtained, not what the content means.\n- Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n Constraints:
- Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.\n"},{"location":"core/#omniread.core.BaseScraper-functions","title":"Functions","text":""},{"location":"core/#omniread.core.BaseScraper.fetch","title":"fetch abstractmethod","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch raw content from the given source.
Parameters:
Name Type Description Defaultsource str Location identifier (URL, file path, S3 URI, etc.).
requiredmetadata Optional[Mapping[str, Any]] Optional hints for the scraper (headers, auth, etc.).
None Returns:
Name Type DescriptionContent Content Content object containing raw bytes and metadata.
Raises:
Type DescriptionException Retrieval-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object.\n"},{"location":"core/#omniread.core.Content","title":"Content dataclass","text":"Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n Normalized representation of extracted content.
NotesResponsibilities:
- A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.\n"},{"location":"core/#omniread.core.Content-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.Content.content_type","title":"content_type class-attribute instance-attribute","text":"content_type: Optional[ContentType] = None\n Optional MIME type of the content, if known.
"},{"location":"core/#omniread.core.Content.metadata","title":"metadataclass-attribute instance-attribute","text":"metadata: Optional[Mapping[str, Any]] = None\n Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
"},{"location":"core/#omniread.core.Content.raw","title":"rawinstance-attribute","text":"raw: bytes\n Raw content bytes as retrieved from the source.
"},{"location":"core/#omniread.core.Content.source","title":"sourceinstance-attribute","text":"source: str\n Identifier of the content origin (URL, file path, or logical name).
"},{"location":"core/#omniread.core.ContentType","title":"ContentType","text":" Bases: str, Enum
Supported MIME types for extracted content.
NotesGuarantees:
- This enum represents the declared or inferred media type of the\n content source.\n- It is primarily used for routing content to the appropriate\n parser or downstream consumer.\n"},{"location":"core/#omniread.core.ContentType-attributes","title":"Attributes","text":""},{"location":"core/#omniread.core.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"HTML = 'text/html'\n HTML document content.
"},{"location":"core/#omniread.core.ContentType.JSON","title":"JSONclass-attribute instance-attribute","text":"JSON = 'application/json'\n JSON document content.
"},{"location":"core/#omniread.core.ContentType.PDF","title":"PDFclass-attribute instance-attribute","text":"PDF = 'application/pdf'\n PDF document content.
"},{"location":"core/#omniread.core.ContentType.XML","title":"XMLclass-attribute instance-attribute","text":"XML = 'application/xml'\n XML document content.
"},{"location":"core/content/","title":"Content","text":""},{"location":"core/content/#omniread.core.content","title":"omniread.core.content","text":""},{"location":"core/content/#omniread.core.content--summary","title":"Summary","text":"Canonical content models for OmniRead.
This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.
The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.
"},{"location":"core/content/#omniread.core.content-classes","title":"Classes","text":""},{"location":"core/content/#omniread.core.content.Content","title":"Contentdataclass","text":"Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n Normalized representation of extracted content.
NotesResponsibilities:
- A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n- This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.\n"},{"location":"core/content/#omniread.core.content.Content-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.Content.content_type","title":"content_type class-attribute instance-attribute","text":"content_type: Optional[ContentType] = None\n Optional MIME type of the content, if known.
"},{"location":"core/content/#omniread.core.content.Content.metadata","title":"metadataclass-attribute instance-attribute","text":"metadata: Optional[Mapping[str, Any]] = None\n Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
"},{"location":"core/content/#omniread.core.content.Content.raw","title":"rawinstance-attribute","text":"raw: bytes\n Raw content bytes as retrieved from the source.
"},{"location":"core/content/#omniread.core.content.Content.source","title":"sourceinstance-attribute","text":"source: str\n Identifier of the content origin (URL, file path, or logical name).
"},{"location":"core/content/#omniread.core.content.ContentType","title":"ContentType","text":" Bases: str, Enum
Supported MIME types for extracted content.
NotesGuarantees:
- This enum represents the declared or inferred media type of the\n content source.\n- It is primarily used for routing content to the appropriate\n parser or downstream consumer.\n"},{"location":"core/content/#omniread.core.content.ContentType-attributes","title":"Attributes","text":""},{"location":"core/content/#omniread.core.content.ContentType.HTML","title":"HTML class-attribute instance-attribute","text":"HTML = 'text/html'\n HTML document content.
"},{"location":"core/content/#omniread.core.content.ContentType.JSON","title":"JSONclass-attribute instance-attribute","text":"JSON = 'application/json'\n JSON document content.
"},{"location":"core/content/#omniread.core.content.ContentType.PDF","title":"PDFclass-attribute instance-attribute","text":"PDF = 'application/pdf'\n PDF document content.
"},{"location":"core/content/#omniread.core.content.ContentType.XML","title":"XMLclass-attribute instance-attribute","text":"XML = 'application/xml'\n XML document content.
"},{"location":"core/parser/","title":"Parser","text":""},{"location":"core/parser/#omniread.core.parser","title":"omniread.core.parser","text":""},{"location":"core/parser/#omniread.core.parser--summary","title":"Summary","text":"Abstract parsing contracts for OmniRead.
This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.
Parsers are responsible for:
Content instanceParsers are not responsible for:
BaseParser(content: Content)\n Bases: ABC, Generic[T]
Base interface for all parsers.
NotesGuarantees:
- A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n- Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n Responsibilities:
- Implementations must declare supported content types via `supported_types`.\n- Implementations must raise parsing-specific exceptions from `parse()`.\n- Implementations must remain deterministic for a given input.\n Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"core/parser/#omniread.core.parser.BaseParser-attributes","title":"Attributes","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: Set[ContentType] = set()\n Set of content types supported by this parser. An empty set indicates that the parser is content-type agnostic.
"},{"location":"core/parser/#omniread.core.parser.BaseParser-functions","title":"Functions","text":""},{"location":"core/parser/#omniread.core.parser.BaseParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse the owned content into structured output.
Returns:
Name Type DescriptionT T Parsed, structured representation.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must fully consume the provided content and\n return a deterministic, structured output.\n"},{"location":"core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"core/scraper/","title":"Scraper","text":""},{"location":"core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":""},{"location":"core/scraper/#omniread.core.scraper--summary","title":"Summary","text":"Abstract scraping contracts for OmniRead.
This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.
Scrapers are responsible for:
Content objectsScrapers are explicitly NOT responsible for:
All interpretation must be delegated to parsers.
"},{"location":"core/scraper/#omniread.core.scraper-classes","title":"Classes","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":" Bases: ABC
Base interface for all scrapers.
NotesResponsibilities:
- A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n- A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n- Scrapers define how content is obtained, not what the content means.\n- Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n Constraints:
- Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.\n"},{"location":"core/scraper/#omniread.core.scraper.BaseScraper-functions","title":"Functions","text":""},{"location":"core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetch abstractmethod","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch raw content from the given source.
Parameters:
Name Type Description Defaultsource str Location identifier (URL, file path, S3 URI, etc.).
requiredmetadata Optional[Mapping[str, Any]] Optional hints for the scraper (headers, auth, etc.).
None Returns:
Name Type DescriptionContent Content Content object containing raw bytes and metadata.
Raises:
Type DescriptionException Retrieval-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object.\n"},{"location":"html/","title":"Html","text":""},{"location":"html/#omniread.html","title":"omniread.html","text":""},{"location":"html/#omniread.html--summary","title":"Summary","text":"HTML format implementation for OmniRead.
This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.
It includes:
Key characteristics:
omniread.core.content.Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.
HTMLScraperHTMLParserHTMLParser(content: Content, features: str = 'html.parser')\n Bases: BaseParser[T], Generic[T]
Base HTML parser.
NotesResponsibilities:
- This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n Guarantees:
- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n Constraints:
- Concrete subclasses must define the output type `T` and implement\n the `parse()` method.\n Initialize the HTML parser.
Parameters:
Name Type Description Defaultcontent Content HTML content to be parsed.
requiredfeatures str BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').
'html.parser' Raises:
Type DescriptionValueError If the content is empty or not valid HTML.
"},{"location":"html/#omniread.html.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/#omniread.html.HTMLParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {HTML}\n Set of content types supported by this parser (HTML only).
"},{"location":"html/#omniread.html.HTMLParser-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Fully parse the HTML content into structured output.
Returns:
Name Type DescriptionT T Parsed representation of type T.
Responsibilities:
- Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output.\n"},{"location":"html/#omniread.html.HTMLParser.parse_div","title":"parse_div staticmethod","text":"parse_div(div: Tag, *, separator: str = ' ') -> str\n Extract normalized text from a <div> element.
Parameters:
Name Type Description Defaultdiv Tag BeautifulSoup tag representing a <div>.
separator str String used to separate text nodes.
' ' Returns:
Name Type Descriptionstr str Flattened, whitespace-normalized text content.
"},{"location":"html/#omniread.html.HTMLParser.parse_link","title":"parse_linkstaticmethod","text":"parse_link(a: Tag) -> Optional[str]\n Extract the hyperlink reference from an <a> element.
Parameters:
Name Type Description Defaulta Tag BeautifulSoup tag representing an anchor.
requiredReturns:
Type DescriptionOptional[str] Optional[str]: The value of the href attribute, or None if absent.
parse_meta() -> dict[str, Any]\n Extract high-level metadata from the HTML document.
Returns:
Type Descriptiondict[str, Any] dict[str, Any]: Dictionary containing extracted metadata.
NotesResponsibilities:
- Extract high-level metadata from the HTML document.\n- This includes: Document title, `<meta>` tag name/property to\n content mappings.\n"},{"location":"html/#omniread.html.HTMLParser.parse_table","title":"parse_table staticmethod","text":"parse_table(table: Tag) -> list[list[str]]\n Parse an HTML table into a 2D list of strings.
Parameters:
Name Type Description Defaulttable Tag BeautifulSoup tag representing a <table>.
Returns:
Type Descriptionlist[list[str]] list[list[str]]: A list of rows, where each row is a list of cell text values.
"},{"location":"html/#omniread.html.HTMLParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n Bases: BaseScraper
Base HTML scraper using httpx.
Responsibilities:
- This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n Constraints:
- The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.\n Initialize the HTML scraper.
Parameters:
Name Type Description Defaultclient Client | None Optional pre-configured httpx.Client. If omitted, a client is created internally.
None timeout float Request timeout in seconds.
15.0 headers Optional[Mapping[str, str]] Optional default HTTP headers.
None follow_redirects bool Whether to follow HTTP redirects.
True"},{"location":"html/#omniread.html.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch an HTML document from the given source.
Parameters:
Name Type Description Defaultsource str URL of the HTML document.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to be merged into the returned content.
None Returns:
Name Type DescriptionContent Content A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.
Raises:
Type DescriptionHTTPError If the HTTP request fails.
ValueError If the response is not valid HTML.
"},{"location":"html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"validate_content_type(response: httpx.Response) -> None\n Validate that the HTTP response contains HTML content.
Parameters:
Name Type Description Defaultresponse Response HTTP response returned by httpx.
Raises:
Type DescriptionValueError If the Content-Type header is missing or does not indicate HTML content.
HTML parser base implementations for OmniRead.
This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.
It supplies:
Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.
HTMLParser(content: Content, features: str = 'html.parser')\n Bases: BaseParser[T], Generic[T]
Base HTML parser.
NotesResponsibilities:
- This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n- Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n Guarantees:
- Accepts only HTML content.\n- Owns a parsed BeautifulSoup DOM tree.\n- Provides pure helper utilities for common HTML structures.\n Constraints:
- Concrete subclasses must define the output type `T` and implement\n the `parse()` method.\n Initialize the HTML parser.
Parameters:
Name Type Description Defaultcontent Content HTML content to be parsed.
requiredfeatures str BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').
'html.parser' Raises:
Type DescriptionValueError If the content is empty or not valid HTML.
"},{"location":"html/parser/#omniread.html.parser.HTMLParser-attributes","title":"Attributes","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {HTML}\n Set of content types supported by this parser (HTML only).
"},{"location":"html/parser/#omniread.html.parser.HTMLParser-functions","title":"Functions","text":""},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Fully parse the HTML content into structured output.
Returns:
Name Type DescriptionT T Parsed representation of type T.
Responsibilities:
- Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output.\n"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_div","title":"parse_div staticmethod","text":"parse_div(div: Tag, *, separator: str = ' ') -> str\n Extract normalized text from a <div> element.
Parameters:
Name Type Description Defaultdiv Tag BeautifulSoup tag representing a <div>.
separator str String used to separate text nodes.
' ' Returns:
Name Type Descriptionstr str Flattened, whitespace-normalized text content.
"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_linkstaticmethod","text":"parse_link(a: Tag) -> Optional[str]\n Extract the hyperlink reference from an <a> element.
Parameters:
Name Type Description Defaulta Tag BeautifulSoup tag representing an anchor.
requiredReturns:
Type DescriptionOptional[str] Optional[str]: The value of the href attribute, or None if absent.
parse_meta() -> dict[str, Any]\n Extract high-level metadata from the HTML document.
Returns:
Type Descriptiondict[str, Any] dict[str, Any]: Dictionary containing extracted metadata.
NotesResponsibilities:
- Extract high-level metadata from the HTML document.\n- This includes: Document title, `<meta>` tag name/property to\n content mappings.\n"},{"location":"html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_table staticmethod","text":"parse_table(table: Tag) -> list[list[str]]\n Parse an HTML table into a 2D list of strings.
Parameters:
Name Type Description Defaulttable Tag BeautifulSoup tag representing a <table>.
Returns:
Type Descriptionlist[list[str]] list[list[str]]: A list of rows, where each row is a list of cell text values.
"},{"location":"html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"html/scraper/","title":"Scraper","text":""},{"location":"html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":""},{"location":"html/scraper/#omniread.html.scraper--summary","title":"Summary","text":"HTML scraping implementation for OmniRead.
This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.
This scraper is responsible for:
This scraper is not responsible for:
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n Bases: BaseScraper
Base HTML scraper using httpx.
Responsibilities:
- This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n- Fetches raw bytes and metadata only.\n- The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n Constraints:
- The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.\n Initialize the HTML scraper.
Parameters:
Name Type Description Defaultclient Client | None Optional pre-configured httpx.Client. If omitted, a client is created internally.
None timeout float Request timeout in seconds.
15.0 headers Optional[Mapping[str, str]] Optional default HTTP headers.
None follow_redirects bool Whether to follow HTTP redirects.
True"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper-functions","title":"Functions","text":""},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch an HTML document from the given source.
Parameters:
Name Type Description Defaultsource str URL of the HTML document.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to be merged into the returned content.
None Returns:
Name Type DescriptionContent Content A Content instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.
Raises:
Type DescriptionHTTPError If the HTTP request fails.
ValueError If the response is not valid HTML.
"},{"location":"html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"validate_content_type(response: httpx.Response) -> None\n Validate that the HTTP response contains HTML content.
Parameters:
Name Type Description Defaultresponse Response HTTP response returned by httpx.
Raises:
Type DescriptionValueError If the Content-Type header is missing or does not indicate HTML content.
PDF format implementation for OmniRead.
This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.
Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes:
Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.
"},{"location":"pdf/#omniread.pdf--public-api","title":"Public API","text":"FileSystemPDFClientPDFScraperPDFParser Bases: BasePDFClient
PDF client that reads from the local filesystem.
NotesGuarantees:
- This client reads PDF files directly from the disk and returns\n their raw binary contents.\n"},{"location":"pdf/#omniread.pdf.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"fetch(path: Path) -> bytes\n Read a PDF file from the local filesystem.
Parameters:
Name Type Description Defaultpath Path Filesystem path to the PDF file.
requiredReturns:
Name Type Descriptionbytes bytes Raw PDF bytes.
Raises:
Type DescriptionFileNotFoundError If the path does not exist.
ValueError If the path exists but is not a file.
"},{"location":"pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"PDFParser(content: Content)\n Bases: BaseParser[T], Generic[T]
Base PDF parser.
NotesResponsibilities:
- This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n Constraints:
- Concrete implementations must define the output type `T` and\n implement the `parse()` method.\n Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"pdf/#omniread.pdf.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {PDF}\n Set of content types supported by this parser (PDF only).
"},{"location":"pdf/#omniread.pdf.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse PDF content into a structured output.
Returns:
Name Type DescriptionT T Parsed representation of type T.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output.\n"},{"location":"pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"PDFScraper(*, client: BasePDFClient)\n Bases: BaseScraper
Scraper for PDF sources.
NotesResponsibilities:
- Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n- Preserves caller-provided metadata.\n Constraints:
- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n Initialize the PDF scraper.
Parameters:
Name Type Description Defaultclient BasePDFClient PDF client responsible for retrieving raw PDF bytes.
required"},{"location":"pdf/#omniread.pdf.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch a PDF document from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF source as understood by the configured PDF client.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to attach to the returned content.
None Returns:
Name Type DescriptionContent Content A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.
Raises:
Type DescriptionException Retrieval-specific errors raised by the PDF client.
"},{"location":"pdf/client/","title":"Client","text":""},{"location":"pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":""},{"location":"pdf/client/#omniread.pdf.client--summary","title":"Summary","text":"PDF client abstractions for OmniRead.
This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.
Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.
Typical backing stores include:
Bases: ABC
Abstract client responsible for retrieving PDF bytes.
Retrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).
NotesResponsibilities:
- Implementations must accept a source identifier appropriate to\n the backing store.\n- Return the full PDF binary payload.\n- Raise retrieval-specific errors on failure.\n"},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetch abstractmethod","text":"fetch(source: Any) -> bytes\n Fetch raw PDF bytes from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF location, such as a file path, object storage key, or remote reference.
requiredReturns:
Name Type Descriptionbytes bytes Raw PDF bytes.
Raises:
Type DescriptionException Retrieval-specific errors defined by the implementation.
"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":" Bases: BasePDFClient
PDF client that reads from the local filesystem.
NotesGuarantees:
- This client reads PDF files directly from the disk and returns\n their raw binary contents.\n"},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient-functions","title":"Functions","text":""},{"location":"pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"fetch(path: Path) -> bytes\n Read a PDF file from the local filesystem.
Parameters:
Name Type Description Defaultpath Path Filesystem path to the PDF file.
requiredReturns:
Name Type Descriptionbytes bytes Raw PDF bytes.
Raises:
Type DescriptionFileNotFoundError If the path does not exist.
ValueError If the path exists but is not a file.
"},{"location":"pdf/parser/","title":"Parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":""},{"location":"pdf/parser/#omniread.pdf.parser--summary","title":"Summary","text":"PDF parser base implementations for OmniRead.
This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.
PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.
"},{"location":"pdf/parser/#omniread.pdf.parser-classes","title":"Classes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"PDFParser(content: Content)\n Bases: BaseParser[T], Generic[T]
Base PDF parser.
NotesResponsibilities:
- This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n Constraints:
- Concrete implementations must define the output type `T` and\n implement the `parse()` method.\n Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-attributes","title":"Attributes","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {PDF}\n Set of content types supported by this parser (PDF only).
"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser-functions","title":"Functions","text":""},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse PDF content into a structured output.
Returns:
Name Type DescriptionT T Parsed representation of type T.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
NotesResponsibilities:
- Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output.\n"},{"location":"pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Name Type Descriptionbool bool True if the content type is supported; False otherwise.
"},{"location":"pdf/scraper/","title":"Scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper--summary","title":"Summary","text":"PDF scraping implementation for OmniRead.
This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.
The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.
PDFScraper(*, client: BasePDFClient)\n Bases: BaseScraper
Scraper for PDF sources.
NotesResponsibilities:
- Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n- Preserves caller-provided metadata.\n Constraints:
- The scraper does not perform parsing or interpretation.\n- Does not assume a specific storage backend.\n Initialize the PDF scraper.
Parameters:
Name Type Description Defaultclient BasePDFClient PDF client responsible for retrieving raw PDF bytes.
required"},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper-functions","title":"Functions","text":""},{"location":"pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch a PDF document from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF source as understood by the configured PDF client.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to attach to the returned content.
None Returns:
Name Type DescriptionContent Content A Content instance containing raw PDF bytes, source identifier, PDF content type, and optional metadata.
Raises:
Type DescriptionException Retrieval-specific errors raised by the PDF client.
"}]}