{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"omniread/","title":"Omniread","text":""},{"location":"omniread/#omniread","title":"omniread","text":"
OmniRead \u2014 format-agnostic content acquisition and parsing framework.
OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.
The library is structured around three core concepts:
Content A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
Scrapers Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
Parsers Components responsible for interpreting acquired content and converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior
"},{"location":"omniread/#omniread--installation","title":"Installation","text":"Install OmniRead using pip:
pip install omniread\n Or with Poetry:
poetry add omniread\n"},{"location":"omniread/#omniread--basic-usage","title":"Basic Usage","text":"HTML example:
from omniread import HTMLScraper, HTMLParser\n\nscraper = HTMLScraper()\ncontent = scraper.fetch(\"https://example.com\")\n\nclass TitleParser(HTMLParser[str]):\n def parse(self) -> str:\n return self._soup.title.string\n\nparser = TitleParser(content)\ntitle = parser.parse()\n PDF example:
from omniread import FileSystemPDFClient, PDFScraper, PDFParser\nfrom pathlib import Path\n\nclient = FileSystemPDFClient()\nscraper = PDFScraper(client=client)\ncontent = scraper.fetch(Path(\"document.pdf\"))\n\nclass TextPDFParser(PDFParser[str]):\n def parse(self) -> str:\n # implement PDF text extraction\n ...\n\nparser = TextPDFParser(content)\nresult = parser.parse()\n"},{"location":"omniread/#omniread--public-api-surface","title":"Public API Surface","text":"This module re-exports the recommended public entry points of OmniRead.
Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.
Core: - Content - ContentType
HTML: - HTMLScraper - HTMLParser
PDF: - FileSystemPDFClient - PDFScraper - PDFParser
"},{"location":"omniread/#omniread--core-philosophy","title":"Core Philosophy","text":"OmniRead is designed as a decoupled content engine:
Content model, ensuring a consistent contract.For those extending OmniRead, follow these \"AI-Native\" docstring principles:
__init__.py..pyi stubs.: description pairs in the Raises section to help agents handle errors gracefully.dataclass","text":"Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n Normalized representation of extracted content.
A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.
This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers
Attributes:
Name Type Descriptionraw bytes Raw content bytes as retrieved from the source.
source str Identifier of the content origin (URL, file path, or logical name).
content_type Optional[ContentType] Optional MIME type of the content, if known.
metadata Optional[Mapping[str, Any]] Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
"},{"location":"omniread/#omniread.ContentType","title":"ContentType","text":" Bases: str, Enum
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.
"},{"location":"omniread/#omniread.ContentType.HTML","title":"HTMLclass-attribute instance-attribute","text":"HTML = 'text/html'\n HTML document content.
"},{"location":"omniread/#omniread.ContentType.JSON","title":"JSONclass-attribute instance-attribute","text":"JSON = 'application/json'\n JSON document content.
"},{"location":"omniread/#omniread.ContentType.PDF","title":"PDFclass-attribute instance-attribute","text":"PDF = 'application/pdf'\n PDF document content.
"},{"location":"omniread/#omniread.ContentType.XML","title":"XMLclass-attribute instance-attribute","text":"XML = 'application/xml'\n XML document content.
"},{"location":"omniread/#omniread.FileSystemPDFClient","title":"FileSystemPDFClient","text":" Bases: BasePDFClient
PDF client that reads from the local filesystem.
This client reads PDF files directly from the disk and returns their raw binary contents.
"},{"location":"omniread/#omniread.FileSystemPDFClient.fetch","title":"fetch","text":"fetch(path: Path) -> bytes\n Read a PDF file from the local filesystem.
Parameters:
Name Type Description Defaultpath Path Filesystem path to the PDF file.
requiredReturns:
Type Descriptionbytes Raw PDF bytes.
Raises:
Type DescriptionFileNotFoundError If the path does not exist.
ValueError If the path exists but is not a file.
"},{"location":"omniread/#omniread.HTMLParser","title":"HTMLParser","text":"HTMLParser(content: Content, features: str = 'html.parser')\n Bases: BaseParser[T], Generic[T]
Base HTML parser.
This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.
Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures
Concrete subclasses must: - Define the output type T - Implement the parse() method
Initialize the HTML parser.
Parameters:
Name Type Description Defaultcontent Content HTML content to be parsed.
requiredfeatures str BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').
'html.parser' Raises:
Type DescriptionValueError If the content is empty or not valid HTML.
"},{"location":"omniread/#omniread.HTMLParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {HTML}\n Set of content types supported by this parser (HTML only).
"},{"location":"omniread/#omniread.HTMLParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed representation of type T.
staticmethod","text":"parse_div(div: Tag, *, separator: str = ' ') -> str\n Extract normalized text from a <div> element.
Parameters:
Name Type Description Defaultdiv Tag BeautifulSoup tag representing a <div>.
separator str String used to separate text nodes.
' ' Returns:
Type Descriptionstr Flattened, whitespace-normalized text content.
"},{"location":"omniread/#omniread.HTMLParser.parse_link","title":"parse_linkstaticmethod","text":"parse_link(a: Tag) -> Optional[str]\n Extract the hyperlink reference from an <a> element.
Parameters:
Name Type Description Defaulta Tag BeautifulSoup tag representing an anchor.
requiredReturns:
Type DescriptionOptional[str] The value of the href attribute, or None if absent.
parse_meta() -> dict[str, Any]\n Extract high-level metadata from the HTML document.
This includes: - Document title - <meta> tag name/property \u2192 content mappings
Returns:
Type Descriptiondict[str, Any] Dictionary containing extracted metadata.
"},{"location":"omniread/#omniread.HTMLParser.parse_table","title":"parse_tablestaticmethod","text":"parse_table(table: Tag) -> list[list[str]]\n Parse an HTML table into a 2D list of strings.
Parameters:
Name Type Description Defaulttable Tag BeautifulSoup tag representing a <table>.
Returns:
Type Descriptionlist[list[str]] A list of rows, where each row is a list of cell text values.
"},{"location":"omniread/#omniread.HTMLParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/#omniread.HTMLScraper","title":"HTMLScraper","text":"HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n Bases: BaseScraper
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.
Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata
The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses
Initialize the HTML scraper.
Parameters:
Name Type Description Defaultclient Optional[Client] Optional pre-configured httpx.Client. If omitted, a client is created internally.
None timeout float Request timeout in seconds.
15.0 headers Optional[Mapping[str, str]] Optional default HTTP headers.
None follow_redirects bool Whether to follow HTTP redirects.
True"},{"location":"omniread/#omniread.HTMLScraper.fetch","title":"fetch","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch an HTML document from the given source.
Parameters:
Name Type Description Defaultsource str URL of the HTML document.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to be merged into the returned content.
None Returns:
Type DescriptionContent A Content instance containing:
Content Content Content Content Raises:
Type DescriptionHTTPError If the HTTP request fails.
ValueError If the response is not valid HTML.
"},{"location":"omniread/#omniread.HTMLScraper.validate_content_type","title":"validate_content_type","text":"validate_content_type(response: httpx.Response) -> None\n Validate that the HTTP response contains HTML content.
Parameters:
Name Type Description Defaultresponse Response HTTP response returned by httpx.
Raises:
Type DescriptionValueError If the Content-Type header is missing or does not indicate HTML content.
PDFParser(content: Content)\n Bases: BaseParser[T], Generic[T]
Base PDF parser.
This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define: - Define the output type T - Implement the parse() method
Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"omniread/#omniread.PDFParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {PDF}\n Set of content types supported by this parser (PDF only).
"},{"location":"omniread/#omniread.PDFParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed representation of type T.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
"},{"location":"omniread/#omniread.PDFParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/#omniread.PDFScraper","title":"PDFScraper","text":"PDFScraper(*, client: BasePDFClient)\n Bases: BaseScraper
Scraper for PDF sources.
Delegates byte retrieval to a PDF client and normalizes output into Content.
The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata
Initialize the PDF scraper.
Parameters:
Name Type Description Defaultclient BasePDFClient PDF client responsible for retrieving raw PDF bytes.
required"},{"location":"omniread/#omniread.PDFScraper.fetch","title":"fetch","text":"fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch a PDF document from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF source as understood by the configured PDF client.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to attach to the returned content.
None Returns:
Type DescriptionContent A Content instance containing:
Content Content Content Content Raises:
Type DescriptionException Retrieval-specific errors raised by the PDF client.
"},{"location":"omniread/core/","title":"Core","text":""},{"location":"omniread/core/#omniread.core","title":"omniread.core","text":"Core domain contracts for OmniRead.
This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).
Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.
Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts
Format-specific behavior must not be introduced at this layer.
"},{"location":"omniread/core/#omniread.core.BaseParser","title":"BaseParser","text":"BaseParser(content: Content)\n Bases: ABC, Generic[T]
Base interface for all parsers.
A parser is a self-contained object that owns the Content it is responsible for interpreting.
Implementations must: - Declare supported content types via supported_types - Raise parsing-specific exceptions from parse() - Remain deterministic for a given input
Consumers may rely on: - Early validation of content compatibility - Type-stable return values from parse()
Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"omniread/core/#omniread.core.BaseParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: Set[ContentType] = set()\n Set of content types supported by this parser.
An empty set indicates that the parser is content-type agnostic.
"},{"location":"omniread/core/#omniread.core.BaseParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse the owned content into structured output.
Implementations must fully consume the provided content and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed, structured representation.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
"},{"location":"omniread/core/#omniread.core.BaseParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/core/#omniread.core.BaseScraper","title":"BaseScraper","text":" Bases: ABC
Base interface for all scrapers.
A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.
A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.
Scrapers define how content is obtained, not what the content means.
Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior
Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser
"},{"location":"omniread/core/#omniread.core.BaseScraper.fetch","title":"fetchabstractmethod","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch raw content from the given source.
Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.
Parameters:
Name Type Description Defaultsource str Location identifier (URL, file path, S3 URI, etc.)
requiredmetadata Optional[Mapping[str, Any]] Optional hints for the scraper (headers, auth, etc.)
None Returns:
Type DescriptionContent Content object containing raw bytes and metadata.
Content Content Content Raises:
Type DescriptionException Retrieval-specific errors as defined by the implementation.
"},{"location":"omniread/core/#omniread.core.Content","title":"Contentdataclass","text":"Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n Normalized representation of extracted content.
A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.
This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers
Attributes:
Name Type Descriptionraw bytes Raw content bytes as retrieved from the source.
source str Identifier of the content origin (URL, file path, or logical name).
content_type Optional[ContentType] Optional MIME type of the content, if known.
metadata Optional[Mapping[str, Any]] Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
"},{"location":"omniread/core/#omniread.core.ContentType","title":"ContentType","text":" Bases: str, Enum
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.
"},{"location":"omniread/core/#omniread.core.ContentType.HTML","title":"HTMLclass-attribute instance-attribute","text":"HTML = 'text/html'\n HTML document content.
"},{"location":"omniread/core/#omniread.core.ContentType.JSON","title":"JSONclass-attribute instance-attribute","text":"JSON = 'application/json'\n JSON document content.
"},{"location":"omniread/core/#omniread.core.ContentType.PDF","title":"PDFclass-attribute instance-attribute","text":"PDF = 'application/pdf'\n PDF document content.
"},{"location":"omniread/core/#omniread.core.ContentType.XML","title":"XMLclass-attribute instance-attribute","text":"XML = 'application/xml'\n XML document content.
"},{"location":"omniread/core/content/","title":"Content","text":""},{"location":"omniread/core/content/#omniread.core.content","title":"omniread.core.content","text":"Canonical content models for OmniRead.
This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.
The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.
"},{"location":"omniread/core/content/#omniread.core.content.Content","title":"Contentdataclass","text":"Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)\n Normalized representation of extracted content.
A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.
This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers
Attributes:
Name Type Descriptionraw bytes Raw content bytes as retrieved from the source.
source str Identifier of the content origin (URL, file path, or logical name).
content_type Optional[ContentType] Optional MIME type of the content, if known.
metadata Optional[Mapping[str, Any]] Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).
"},{"location":"omniread/core/content/#omniread.core.content.ContentType","title":"ContentType","text":" Bases: str, Enum
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.
"},{"location":"omniread/core/content/#omniread.core.content.ContentType.HTML","title":"HTMLclass-attribute instance-attribute","text":"HTML = 'text/html'\n HTML document content.
"},{"location":"omniread/core/content/#omniread.core.content.ContentType.JSON","title":"JSONclass-attribute instance-attribute","text":"JSON = 'application/json'\n JSON document content.
"},{"location":"omniread/core/content/#omniread.core.content.ContentType.PDF","title":"PDFclass-attribute instance-attribute","text":"PDF = 'application/pdf'\n PDF document content.
"},{"location":"omniread/core/content/#omniread.core.content.ContentType.XML","title":"XMLclass-attribute instance-attribute","text":"XML = 'application/xml'\n XML document content.
"},{"location":"omniread/core/parser/","title":"Parser","text":""},{"location":"omniread/core/parser/#omniread.core.parser","title":"omniread.core.parser","text":"Abstract parsing contracts for OmniRead.
This module defines the format-agnostic parser interface used to transform raw content into structured, typed representations.
Parsers are responsible for: - Interpreting a single Content instance - Validating compatibility with the content type - Producing a structured output suitable for downstream consumers
Parsers are not responsible for: - Fetching or acquiring content - Performing retries or error recovery - Managing multiple content sources
"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser","title":"BaseParser","text":"BaseParser(content: Content)\n Bases: ABC, Generic[T]
Base interface for all parsers.
A parser is a self-contained object that owns the Content it is responsible for interpreting.
Implementations must: - Declare supported content types via supported_types - Raise parsing-specific exceptions from parse() - Remain deterministic for a given input
Consumers may rely on: - Early validation of content compatibility - Type-stable return values from parse()
Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: Set[ContentType] = set()\n Set of content types supported by this parser.
An empty set indicates that the parser is content-type agnostic.
"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse the owned content into structured output.
Implementations must fully consume the provided content and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed, structured representation.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
"},{"location":"omniread/core/parser/#omniread.core.parser.BaseParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/core/scraper/","title":"Scraper","text":""},{"location":"omniread/core/scraper/#omniread.core.scraper","title":"omniread.core.scraper","text":"Abstract scraping contracts for OmniRead.
This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.
Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized Content objects
Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing
All interpretation must be delegated to parsers.
"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper","title":"BaseScraper","text":" Bases: ABC
Base interface for all scrapers.
A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.
A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.
Scrapers define how content is obtained, not what the content means.
Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior
Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser
"},{"location":"omniread/core/scraper/#omniread.core.scraper.BaseScraper.fetch","title":"fetchabstractmethod","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch raw content from the given source.
Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.
Parameters:
Name Type Description Defaultsource str Location identifier (URL, file path, S3 URI, etc.)
requiredmetadata Optional[Mapping[str, Any]] Optional hints for the scraper (headers, auth, etc.)
None Returns:
Type DescriptionContent Content object containing raw bytes and metadata.
Content Content Content Raises:
Type DescriptionException Retrieval-specific errors as defined by the implementation.
"},{"location":"omniread/html/","title":"Html","text":""},{"location":"omniread/html/#omniread.html","title":"omniread.html","text":"HTML format implementation for OmniRead.
This package provides HTML-specific implementations of the core OmniRead contracts defined in omniread.core.
It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents
This package: - Implements, but does not redefine, core contracts - May contain HTML-specific behavior and edge-case handling - Produces canonical content models defined in omniread.core.content
Consumers should depend on omniread.core interfaces wherever possible and use this package only when HTML-specific behavior is required.
HTMLParser(content: Content, features: str = 'html.parser')\n Bases: BaseParser[T], Generic[T]
Base HTML parser.
This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.
Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures
Concrete subclasses must: - Define the output type T - Implement the parse() method
Initialize the HTML parser.
Parameters:
Name Type Description Defaultcontent Content HTML content to be parsed.
requiredfeatures str BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').
'html.parser' Raises:
Type DescriptionValueError If the content is empty or not valid HTML.
"},{"location":"omniread/html/#omniread.html.HTMLParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {HTML}\n Set of content types supported by this parser (HTML only).
"},{"location":"omniread/html/#omniread.html.HTMLParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed representation of type T.
staticmethod","text":"parse_div(div: Tag, *, separator: str = ' ') -> str\n Extract normalized text from a <div> element.
Parameters:
Name Type Description Defaultdiv Tag BeautifulSoup tag representing a <div>.
separator str String used to separate text nodes.
' ' Returns:
Type Descriptionstr Flattened, whitespace-normalized text content.
"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_link","title":"parse_linkstaticmethod","text":"parse_link(a: Tag) -> Optional[str]\n Extract the hyperlink reference from an <a> element.
Parameters:
Name Type Description Defaulta Tag BeautifulSoup tag representing an anchor.
requiredReturns:
Type DescriptionOptional[str] The value of the href attribute, or None if absent.
parse_meta() -> dict[str, Any]\n Extract high-level metadata from the HTML document.
This includes: - Document title - <meta> tag name/property \u2192 content mappings
Returns:
Type Descriptiondict[str, Any] Dictionary containing extracted metadata.
"},{"location":"omniread/html/#omniread.html.HTMLParser.parse_table","title":"parse_tablestaticmethod","text":"parse_table(table: Tag) -> list[list[str]]\n Parse an HTML table into a 2D list of strings.
Parameters:
Name Type Description Defaulttable Tag BeautifulSoup tag representing a <table>.
Returns:
Type Descriptionlist[list[str]] A list of rows, where each row is a list of cell text values.
"},{"location":"omniread/html/#omniread.html.HTMLParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/html/#omniread.html.HTMLScraper","title":"HTMLScraper","text":"HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n Bases: BaseScraper
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.
Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata
The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses
Initialize the HTML scraper.
Parameters:
Name Type Description Defaultclient Optional[Client] Optional pre-configured httpx.Client. If omitted, a client is created internally.
None timeout float Request timeout in seconds.
15.0 headers Optional[Mapping[str, str]] Optional default HTTP headers.
None follow_redirects bool Whether to follow HTTP redirects.
True"},{"location":"omniread/html/#omniread.html.HTMLScraper.fetch","title":"fetch","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch an HTML document from the given source.
Parameters:
Name Type Description Defaultsource str URL of the HTML document.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to be merged into the returned content.
None Returns:
Type DescriptionContent A Content instance containing:
Content Content Content Content Raises:
Type DescriptionHTTPError If the HTTP request fails.
ValueError If the response is not valid HTML.
"},{"location":"omniread/html/#omniread.html.HTMLScraper.validate_content_type","title":"validate_content_type","text":"validate_content_type(response: httpx.Response) -> None\n Validate that the HTTP response contains HTML content.
Parameters:
Name Type Description Defaultresponse Response HTTP response returned by httpx.
Raises:
Type DescriptionValueError If the Content-Type header is missing or does not indicate HTML content.
HTML parser base implementations for OmniRead.
This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.
It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements
Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.
HTMLParser(content: Content, features: str = 'html.parser')\n Bases: BaseParser[T], Generic[T]
Base HTML parser.
This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.
Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures
Concrete subclasses must: - Define the output type T - Implement the parse() method
Initialize the HTML parser.
Parameters:
Name Type Description Defaultcontent Content HTML content to be parsed.
requiredfeatures str BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').
'html.parser' Raises:
Type DescriptionValueError If the content is empty or not valid HTML.
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {HTML}\n Set of content types supported by this parser (HTML only).
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed representation of type T.
staticmethod","text":"parse_div(div: Tag, *, separator: str = ' ') -> str\n Extract normalized text from a <div> element.
Parameters:
Name Type Description Defaultdiv Tag BeautifulSoup tag representing a <div>.
separator str String used to separate text nodes.
' ' Returns:
Type Descriptionstr Flattened, whitespace-normalized text content.
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_link","title":"parse_linkstaticmethod","text":"parse_link(a: Tag) -> Optional[str]\n Extract the hyperlink reference from an <a> element.
Parameters:
Name Type Description Defaulta Tag BeautifulSoup tag representing an anchor.
requiredReturns:
Type DescriptionOptional[str] The value of the href attribute, or None if absent.
parse_meta() -> dict[str, Any]\n Extract high-level metadata from the HTML document.
This includes: - Document title - <meta> tag name/property \u2192 content mappings
Returns:
Type Descriptiondict[str, Any] Dictionary containing extracted metadata.
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.parse_table","title":"parse_tablestaticmethod","text":"parse_table(table: Tag) -> list[list[str]]\n Parse an HTML table into a 2D list of strings.
Parameters:
Name Type Description Defaulttable Tag BeautifulSoup tag representing a <table>.
Returns:
Type Descriptionlist[list[str]] A list of rows, where each row is a list of cell text values.
"},{"location":"omniread/html/parser/#omniread.html.parser.HTMLParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/html/scraper/","title":"Scraper","text":""},{"location":"omniread/html/scraper/#omniread.html.scraper","title":"omniread.html.scraper","text":"HTML scraping implementation for OmniRead.
This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.
This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content
This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting
"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper","title":"HTMLScraper","text":"HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)\n Bases: BaseScraper
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.
Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata
The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses
Initialize the HTML scraper.
Parameters:
Name Type Description Defaultclient Optional[Client] Optional pre-configured httpx.Client. If omitted, a client is created internally.
None timeout float Request timeout in seconds.
15.0 headers Optional[Mapping[str, str]] Optional default HTTP headers.
None follow_redirects bool Whether to follow HTTP redirects.
True"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.fetch","title":"fetch","text":"fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch an HTML document from the given source.
Parameters:
Name Type Description Defaultsource str URL of the HTML document.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to be merged into the returned content.
None Returns:
Type DescriptionContent A Content instance containing:
Content Content Content Content Raises:
Type DescriptionHTTPError If the HTTP request fails.
ValueError If the response is not valid HTML.
"},{"location":"omniread/html/scraper/#omniread.html.scraper.HTMLScraper.validate_content_type","title":"validate_content_type","text":"validate_content_type(response: httpx.Response) -> None\n Validate that the HTTP response contains HTML content.
Parameters:
Name Type Description Defaultresponse Response HTTP response returned by httpx.
Raises:
Type DescriptionValueError If the Content-Type header is missing or does not indicate HTML content.
PDF format implementation for OmniRead.
This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.
Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries
Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.
"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient","title":"FileSystemPDFClient","text":" Bases: BasePDFClient
PDF client that reads from the local filesystem.
This client reads PDF files directly from the disk and returns their raw binary contents.
"},{"location":"omniread/pdf/#omniread.pdf.FileSystemPDFClient.fetch","title":"fetch","text":"fetch(path: Path) -> bytes\n Read a PDF file from the local filesystem.
Parameters:
Name Type Description Defaultpath Path Filesystem path to the PDF file.
requiredReturns:
Type Descriptionbytes Raw PDF bytes.
Raises:
Type DescriptionFileNotFoundError If the path does not exist.
ValueError If the path exists but is not a file.
"},{"location":"omniread/pdf/#omniread.pdf.PDFParser","title":"PDFParser","text":"PDFParser(content: Content)\n Bases: BaseParser[T], Generic[T]
Base PDF parser.
This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define: - Define the output type T - Implement the parse() method
Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {PDF}\n Set of content types supported by this parser (PDF only).
"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed representation of type T.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
"},{"location":"omniread/pdf/#omniread.pdf.PDFParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper","title":"PDFScraper","text":"PDFScraper(*, client: BasePDFClient)\n Bases: BaseScraper
Scraper for PDF sources.
Delegates byte retrieval to a PDF client and normalizes output into Content.
The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata
Initialize the PDF scraper.
Parameters:
Name Type Description Defaultclient BasePDFClient PDF client responsible for retrieving raw PDF bytes.
required"},{"location":"omniread/pdf/#omniread.pdf.PDFScraper.fetch","title":"fetch","text":"fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch a PDF document from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF source as understood by the configured PDF client.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to attach to the returned content.
None Returns:
Type DescriptionContent A Content instance containing:
Content Content Content Content Raises:
Type DescriptionException Retrieval-specific errors raised by the PDF client.
"},{"location":"omniread/pdf/client/","title":"Client","text":""},{"location":"omniread/pdf/client/#omniread.pdf.client","title":"omniread.pdf.client","text":"PDF client abstractions for OmniRead.
This module defines the client layer responsible for retrieving raw PDF bytes from a concrete backing store.
Clients provide low-level access to PDF binaries and are intentionally decoupled from scraping and parsing logic. They do not perform validation, interpretation, or content extraction.
Typical backing stores include: - Local filesystems - Object storage (S3, GCS, etc.) - Network file systems
"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient","title":"BasePDFClient","text":" Bases: ABC
Abstract client responsible for retrieving PDF bytes from a specific backing store (filesystem, S3, FTP, etc.).
Implementations must: - Accept a source identifier appropriate to the backing store - Return the full PDF binary payload - Raise retrieval-specific errors on failure
"},{"location":"omniread/pdf/client/#omniread.pdf.client.BasePDFClient.fetch","title":"fetchabstractmethod","text":"fetch(source: Any) -> bytes\n Fetch raw PDF bytes from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF location, such as a file path, object storage key, or remote reference.
requiredReturns:
Type Descriptionbytes Raw PDF bytes.
Raises:
Type DescriptionException Retrieval-specific errors defined by the implementation.
"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient","title":"FileSystemPDFClient","text":" Bases: BasePDFClient
PDF client that reads from the local filesystem.
This client reads PDF files directly from the disk and returns their raw binary contents.
"},{"location":"omniread/pdf/client/#omniread.pdf.client.FileSystemPDFClient.fetch","title":"fetch","text":"fetch(path: Path) -> bytes\n Read a PDF file from the local filesystem.
Parameters:
Name Type Description Defaultpath Path Filesystem path to the PDF file.
requiredReturns:
Type Descriptionbytes Raw PDF bytes.
Raises:
Type DescriptionFileNotFoundError If the path does not exist.
ValueError If the path exists but is not a file.
"},{"location":"omniread/pdf/parser/","title":"Parser","text":""},{"location":"omniread/pdf/parser/#omniread.pdf.parser","title":"omniread.pdf.parser","text":"PDF parser base implementations for OmniRead.
This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.
PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.
"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser","title":"PDFParser","text":"PDFParser(content: Content)\n Bases: BaseParser[T], Generic[T]
Base PDF parser.
This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define: - Define the output type T - Implement the parse() method
Initialize the parser with content to be parsed.
Parameters:
Name Type Description Defaultcontent Content Content instance to be parsed.
requiredRaises:
Type DescriptionValueError If the content type is not supported by this parser.
"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supported_types","title":"supported_typesclass-attribute instance-attribute","text":"supported_types: set[ContentType] = {PDF}\n Set of content types supported by this parser (PDF only).
"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.parse","title":"parseabstractmethod","text":"parse() -> T\n Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.
Returns:
Type DescriptionT Parsed representation of type T.
Raises:
Type DescriptionException Parsing-specific errors as defined by the implementation.
"},{"location":"omniread/pdf/parser/#omniread.pdf.parser.PDFParser.supports","title":"supports","text":"supports() -> bool\n Check whether this parser supports the content's type.
Returns:
Type Descriptionbool True if the content type is supported; False otherwise.
"},{"location":"omniread/pdf/scraper/","title":"Scraper","text":""},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper","title":"omniread.pdf.scraper","text":"PDF scraping implementation for OmniRead.
This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.
The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.
PDFScraper(*, client: BasePDFClient)\n Bases: BaseScraper
Scraper for PDF sources.
Delegates byte retrieval to a PDF client and normalizes output into Content.
The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata
Initialize the PDF scraper.
Parameters:
Name Type Description Defaultclient BasePDFClient PDF client responsible for retrieving raw PDF bytes.
required"},{"location":"omniread/pdf/scraper/#omniread.pdf.scraper.PDFScraper.fetch","title":"fetch","text":"fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content\n Fetch a PDF document from the given source.
Parameters:
Name Type Description Defaultsource Any Identifier of the PDF source as understood by the configured PDF client.
requiredmetadata Optional[Mapping[str, Any]] Optional metadata to attach to the returned content.
None Returns:
Type DescriptionContent A Content instance containing:
Content Content Content Content Raises:
Type DescriptionException Retrieval-specific errors raised by the PDF client.
"}]}