Compare commits

..

1 Commits

Author SHA1 Message Date
de7d04eb1a updated docs strings and added README.md 2026-03-08 17:59:56 +05:30
26 changed files with 546 additions and 406 deletions

108
README.md Normal file
View File

@@ -0,0 +1,108 @@
# omniread
# Summary
`OmniRead` — format-agnostic content acquisition and parsing framework.
`OmniRead` provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.
The library is structured around three core concepts:
1. **`Content`**: A canonical, format-agnostic container representing raw content
bytes and minimal contextual metadata.
2. **`Scrapers`**: Components responsible for *acquiring* raw content from a
source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret
content.
3. **`Parsers`**: Components responsible for *interpreting* acquired content and
converting it into structured, typed representations.
`OmniRead` deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation.
- Replaceable implementations per format.
- Predictable, testable behavior.
# Installation
Install `OmniRead` using pip:
```bash
pip install omniread
```
Install OmniRead using Poetry:
```bash
poetry add omniread
```
---
## Quick start
Example:
HTML example:
```python
from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch("https://example.com")
class TitleParser(HTMLParser[str]):
def parse(self) -> str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
```
PDF example:
```python
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))
class TextPDFParser(PDFParser[str]):
def parse(self) -> str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
```
---
# Public API
This module re-exports the **recommended public entry points** of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.
- `Content`: Canonical content model.
- `ContentType`: Supported media types.
- `HTMLScraper`: HTTP-based HTML acquisition.
- `HTMLParser`: Base parser for HTML DOM interpretation.
- `FileSystemPDFClient`: Local filesystem PDF access.
- `PDFScraper`: PDF-specific content acquisition.
- `PDFParser`: Base parser for PDF binary interpretation.
---
# Core Philosophy
`OmniRead` is designed as a **decoupled content engine**:
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither
knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model,
ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input
is HTML, PDF, or JSON.
---

View File

@@ -1,4 +1,3 @@
# omniread # omniread
::: omniread ::: omniread
- [Omniread](omniread/)

View File

@@ -2,7 +2,7 @@
"module": "omniread.core.content", "module": "omniread.core.content",
"content": { "content": {
"path": "omniread.core.content", "path": "omniread.core.content",
"docstring": "Canonical content models for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.", "docstring": "# Summary\n\nCanonical content models for OmniRead.\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.",
"objects": { "objects": {
"Enum": { "Enum": {
"name": "Enum", "name": "Enum",
@@ -43,8 +43,8 @@
"name": "ContentType", "name": "ContentType",
"kind": "class", "kind": "class",
"path": "omniread.core.content.ContentType", "path": "omniread.core.content.ContentType",
"signature": "<bound method Class.signature of Class('ContentType', 21, 42)>", "signature": "<bound method Class.signature of Class('ContentType', 19, 42)>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -80,8 +80,8 @@
"name": "Content", "name": "Content",
"kind": "class", "kind": "class",
"path": "omniread.core.content.Content", "path": "omniread.core.content.Content",
"signature": "<bound method Class.signature of Class('Content', 45, 75)>", "signature": "<bound method Class.signature of Class('Content', 45, 77)>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",

View File

@@ -2,14 +2,14 @@
"module": "omniread.core", "module": "omniread.core",
"content": { "content": {
"path": "omniread.core", "path": "omniread.core",
"docstring": "Core domain contracts for OmniRead.\n\n---\n\n## Summary\n\nThis package defines the **format-agnostic domain layer** of OmniRead.\nIt exposes canonical content models and abstract interfaces that are\nimplemented by format-specific modules (HTML, PDF, etc.).\n\nPublic exports from this package are considered **stable contracts** and\nare safe for downstream consumers to depend on.\n\nSubmodules:\n- content: Canonical content models and enums\n- parser: Abstract parsing contracts\n- scraper: Abstract scraping contracts\n\nFormat-specific behavior must not be introduced at this layer.\n\n---\n\n## Public API\n\n Content\n ContentType\n\n---", "docstring": "# Summary\n\nCore domain contracts for OmniRead.\n\nThis package defines the **format-agnostic domain layer** of OmniRead.\nIt exposes canonical content models and abstract interfaces that are\nimplemented by format-specific modules (HTML, PDF, etc.).\n\nPublic exports from this package are considered **stable contracts** and\nare safe for downstream consumers to depend on.\n\nSubmodules:\n\n- `content`: Canonical content models and enums.\n- `parser`: Abstract parsing contracts.\n- `scraper`: Abstract scraping contracts.\n\nFormat-specific behavior must not be introduced at this layer.\n\n---\n\n# Public API\n\n- `Content`\n- `ContentType`\n\n---",
"objects": { "objects": {
"Content": { "Content": {
"name": "Content", "name": "Content",
"kind": "class", "kind": "class",
"path": "omniread.core.Content", "path": "omniread.core.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -46,7 +46,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.ContentType", "path": "omniread.core.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -83,7 +83,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.BaseParser", "path": "omniread.core.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -104,7 +104,7 @@
"kind": "function", "kind": "function",
"path": "omniread.core.BaseParser.parse", "path": "omniread.core.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -120,14 +120,14 @@
"kind": "class", "kind": "class",
"path": "omniread.core.BaseScraper", "path": "omniread.core.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.core.BaseScraper.fetch", "path": "omniread.core.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -136,7 +136,7 @@
"kind": "module", "kind": "module",
"path": "omniread.core.content", "path": "omniread.core.content",
"signature": null, "signature": null,
"docstring": "Canonical content models for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.", "docstring": "# Summary\n\nCanonical content models for OmniRead.\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.",
"members": { "members": {
"Enum": { "Enum": {
"name": "Enum", "name": "Enum",
@@ -177,8 +177,8 @@
"name": "ContentType", "name": "ContentType",
"kind": "class", "kind": "class",
"path": "omniread.core.content.ContentType", "path": "omniread.core.content.ContentType",
"signature": "<bound method Class.signature of Class('ContentType', 21, 42)>", "signature": "<bound method Class.signature of Class('ContentType', 19, 42)>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -214,8 +214,8 @@
"name": "Content", "name": "Content",
"kind": "class", "kind": "class",
"path": "omniread.core.content.Content", "path": "omniread.core.content.Content",
"signature": "<bound method Class.signature of Class('Content', 45, 75)>", "signature": "<bound method Class.signature of Class('Content', 45, 77)>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -254,7 +254,7 @@
"kind": "module", "kind": "module",
"path": "omniread.core.parser", "path": "omniread.core.parser",
"signature": null, "signature": null,
"docstring": "Abstract parsing contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources", "docstring": "# Summary\n\nAbstract parsing contracts for OmniRead.\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources",
"members": { "members": {
"ABC": { "ABC": {
"name": "ABC", "name": "ABC",
@@ -296,7 +296,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.parser.Content", "path": "omniread.core.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -333,7 +333,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.parser.ContentType", "path": "omniread.core.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -376,8 +376,8 @@
"name": "BaseParser", "name": "BaseParser",
"kind": "class", "kind": "class",
"path": "omniread.core.parser.BaseParser", "path": "omniread.core.parser.BaseParser",
"signature": "<bound method Class.signature of Class('BaseParser', 30, 108)>", "signature": "<bound method Class.signature of Class('BaseParser', 30, 111)>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -397,14 +397,14 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.core.parser.BaseParser.parse", "path": "omniread.core.parser.BaseParser.parse",
"signature": "<bound method Function.signature of Function('parse', 73, 91)>", "signature": "<bound method Function.signature of Function('parse', 75, 94)>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
"kind": "function", "kind": "function",
"path": "omniread.core.parser.BaseParser.supports", "path": "omniread.core.parser.BaseParser.supports",
"signature": "<bound method Function.signature of Function('supports', 93, 108)>", "signature": "<bound method Function.signature of Function('supports', 96, 111)>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise." "docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise."
} }
} }
@@ -416,7 +416,7 @@
"kind": "module", "kind": "module",
"path": "omniread.core.scraper", "path": "omniread.core.scraper",
"signature": null, "signature": null,
"docstring": "Abstract scraping contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.", "docstring": "# Summary\n\nAbstract scraping contracts for OmniRead.\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.",
"members": { "members": {
"ABC": { "ABC": {
"name": "ABC", "name": "ABC",
@@ -458,7 +458,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.scraper.Content", "path": "omniread.core.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -494,15 +494,15 @@
"name": "BaseScraper", "name": "BaseScraper",
"kind": "class", "kind": "class",
"path": "omniread.core.scraper.BaseScraper", "path": "omniread.core.scraper.BaseScraper",
"signature": "<bound method Class.signature of Class('BaseScraper', 30, 76)>", "signature": "<bound method Class.signature of Class('BaseScraper', 30, 82)>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.core.scraper.BaseScraper.fetch", "path": "omniread.core.scraper.BaseScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 47, 76)>", "signature": "<bound method Function.signature of Function('fetch', 51, 82)>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
} }

View File

@@ -2,7 +2,7 @@
"module": "omniread.core.parser", "module": "omniread.core.parser",
"content": { "content": {
"path": "omniread.core.parser", "path": "omniread.core.parser",
"docstring": "Abstract parsing contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources", "docstring": "# Summary\n\nAbstract parsing contracts for OmniRead.\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources",
"objects": { "objects": {
"ABC": { "ABC": {
"name": "ABC", "name": "ABC",
@@ -44,7 +44,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.parser.Content", "path": "omniread.core.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -81,7 +81,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.parser.ContentType", "path": "omniread.core.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -124,8 +124,8 @@
"name": "BaseParser", "name": "BaseParser",
"kind": "class", "kind": "class",
"path": "omniread.core.parser.BaseParser", "path": "omniread.core.parser.BaseParser",
"signature": "<bound method Class.signature of Class('BaseParser', 30, 108)>", "signature": "<bound method Class.signature of Class('BaseParser', 30, 111)>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -145,14 +145,14 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.core.parser.BaseParser.parse", "path": "omniread.core.parser.BaseParser.parse",
"signature": "<bound method Function.signature of Function('parse', 73, 91)>", "signature": "<bound method Function.signature of Function('parse', 75, 94)>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
"kind": "function", "kind": "function",
"path": "omniread.core.parser.BaseParser.supports", "path": "omniread.core.parser.BaseParser.supports",
"signature": "<bound method Function.signature of Function('supports', 93, 108)>", "signature": "<bound method Function.signature of Function('supports', 96, 111)>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise." "docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise."
} }
} }

View File

@@ -2,7 +2,7 @@
"module": "omniread.core.scraper", "module": "omniread.core.scraper",
"content": { "content": {
"path": "omniread.core.scraper", "path": "omniread.core.scraper",
"docstring": "Abstract scraping contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.", "docstring": "# Summary\n\nAbstract scraping contracts for OmniRead.\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.",
"objects": { "objects": {
"ABC": { "ABC": {
"name": "ABC", "name": "ABC",
@@ -44,7 +44,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.scraper.Content", "path": "omniread.core.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -80,15 +80,15 @@
"name": "BaseScraper", "name": "BaseScraper",
"kind": "class", "kind": "class",
"path": "omniread.core.scraper.BaseScraper", "path": "omniread.core.scraper.BaseScraper",
"signature": "<bound method Class.signature of Class('BaseScraper', 30, 76)>", "signature": "<bound method Class.signature of Class('BaseScraper', 30, 82)>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.core.scraper.BaseScraper.fetch", "path": "omniread.core.scraper.BaseScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 47, 76)>", "signature": "<bound method Function.signature of Function('fetch', 51, 82)>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
} }

View File

@@ -2,14 +2,14 @@
"module": "omniread.html", "module": "omniread.html",
"content": { "content": {
"path": "omniread.html", "path": "omniread.html",
"docstring": "HTML format implementation for OmniRead.\n\n---\n\n## Summary\n\nThis package provides **HTML-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nIt includes:\n- HTML parsers that interpret HTML content\n- HTML scrapers that retrieve HTML documents\n\nThis package:\n- Implements, but does not redefine, core contracts\n- May contain HTML-specific behavior and edge-case handling\n- Produces canonical content models defined in `omniread.core.content`\n\nConsumers should depend on `omniread.core` interfaces wherever possible and\nuse this package only when HTML-specific behavior is required.\n\n---\n\n## Public API\n\n HTMLScraper\n HTMLParser\n\n---", "docstring": "# Summary\n\nHTML format implementation for OmniRead.\n\nThis package provides **HTML-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nIt includes:\n\n- HTML parsers that interpret HTML content.\n- HTML scrapers that retrieve HTML documents.\n\nKey characteristics:\n\n- Implements, but does not redefine, core contracts.\n- May contain HTML-specific behavior and edge-case handling.\n- Produces canonical content models defined in `omniread.core.content`.\n\nConsumers should depend on `omniread.core` interfaces wherever possible and\nuse this package only when HTML-specific behavior is required.\n\n---\n\n# Public API\n\n- `HTMLScraper`\n- `HTMLParser`\n\n---",
"objects": { "objects": {
"HTMLScraper": { "HTMLScraper": {
"name": "HTMLScraper", "name": "HTMLScraper",
"kind": "class", "kind": "class",
"path": "omniread.html.HTMLScraper", "path": "omniread.html.HTMLScraper",
"signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.scraper.HTMLScraper')>", "signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.scraper.HTMLScraper')>",
"docstring": "Base HTML scraper using httpx.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n - Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n\n **Constraints:**\n \n - The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.",
"members": { "members": {
"content_type": { "content_type": {
"name": "content_type", "name": "content_type",
@@ -39,7 +39,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.HTMLParser", "path": "omniread.html.HTMLParser",
"signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.parser.HTMLParser')>", "signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.parser.HTMLParser')>",
"docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n - Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n\n **Guarantees:**\n\n - Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n\n **Constraints:**\n \n - Concrete subclasses must define the output type `T` and implement the `parse()` method", "docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n - Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n\n **Guarantees:**\n\n - Accepts only HTML content.\n - Owns a parsed BeautifulSoup DOM tree.\n - Provides pure helper utilities for common HTML structures.\n\n **Constraints:**\n\n - Concrete subclasses must define the output type `T` and implement\n the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -53,7 +53,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.HTMLParser.parse", "path": "omniread.html.HTMLParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>",
"docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a deterministic, structured output" "docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output."
}, },
"parse_div": { "parse_div": {
"name": "parse_div", "name": "parse_div",
@@ -81,7 +81,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.HTMLParser.parse_meta", "path": "omniread.html.HTMLParser.parse_meta",
"signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>", "signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>",
"docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document\n - This includes: Document title, `<meta>` tag name/property content mappings" "docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document.\n - This includes: Document title, `<meta>` tag name/property to\n content mappings."
} }
} }
}, },
@@ -90,7 +90,7 @@
"kind": "module", "kind": "module",
"path": "omniread.html.parser", "path": "omniread.html.parser",
"signature": null, "signature": null,
"docstring": "HTML parser base implementations for OmniRead.\n\n---\n\n## Summary\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.", "docstring": "# Summary\n\nHTML parser base implementations for OmniRead.\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.",
"members": { "members": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -146,7 +146,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.ContentType", "path": "omniread.html.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -183,7 +183,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.Content", "path": "omniread.html.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -220,7 +220,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.BaseParser", "path": "omniread.html.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -241,7 +241,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.parser.BaseParser.parse", "path": "omniread.html.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -263,8 +263,8 @@
"name": "HTMLParser", "name": "HTMLParser",
"kind": "class", "kind": "class",
"path": "omniread.html.parser.HTMLParser", "path": "omniread.html.parser.HTMLParser",
"signature": "<bound method Class.signature of Class('HTMLParser', 31, 199)>", "signature": "<bound method Class.signature of Class('HTMLParser', 30, 205)>",
"docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n - Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n\n **Guarantees:**\n\n - Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n\n **Constraints:**\n \n - Concrete subclasses must define the output type `T` and implement the `parse()` method", "docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n - Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n\n **Guarantees:**\n\n - Accepts only HTML content.\n - Owns a parsed BeautifulSoup DOM tree.\n - Provides pure helper utilities for common HTML structures.\n\n **Constraints:**\n\n - Concrete subclasses must define the output type `T` and implement\n the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -277,36 +277,36 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse", "path": "omniread.html.parser.HTMLParser.parse",
"signature": "<bound method Function.signature of Function('parse', 77, 91)>", "signature": "<bound method Function.signature of Function('parse', 81, 96)>",
"docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a deterministic, structured output" "docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output."
}, },
"parse_div": { "parse_div": {
"name": "parse_div", "name": "parse_div",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_div", "path": "omniread.html.parser.HTMLParser.parse_div",
"signature": "<bound method Function.signature of Function('parse_div', 97, 112)>", "signature": "<bound method Function.signature of Function('parse_div', 102, 117)>",
"docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div (Tag):\n BeautifulSoup tag representing a `<div>`.\n separator (str, optional):\n String used to separate text nodes.\n\nReturns:\n str:\n Flattened, whitespace-normalized text content." "docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div (Tag):\n BeautifulSoup tag representing a `<div>`.\n separator (str, optional):\n String used to separate text nodes.\n\nReturns:\n str:\n Flattened, whitespace-normalized text content."
}, },
"parse_link": { "parse_link": {
"name": "parse_link", "name": "parse_link",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_link", "path": "omniread.html.parser.HTMLParser.parse_link",
"signature": "<bound method Function.signature of Function('parse_link', 114, 127)>", "signature": "<bound method Function.signature of Function('parse_link', 119, 132)>",
"docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a (Tag):\n BeautifulSoup tag representing an anchor.\n\nReturns:\n Optional[str]:\n The value of the `href` attribute, or None if absent." "docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a (Tag):\n BeautifulSoup tag representing an anchor.\n\nReturns:\n Optional[str]:\n The value of the `href` attribute, or None if absent."
}, },
"parse_table": { "parse_table": {
"name": "parse_table", "name": "parse_table",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_table", "path": "omniread.html.parser.HTMLParser.parse_table",
"signature": "<bound method Function.signature of Function('parse_table', 129, 150)>", "signature": "<bound method Function.signature of Function('parse_table', 134, 155)>",
"docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table (Tag):\n BeautifulSoup tag representing a `<table>`.\n\nReturns:\n list[list[str]]:\n A list of rows, where each row is a list of cell text values." "docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table (Tag):\n BeautifulSoup tag representing a `<table>`.\n\nReturns:\n list[list[str]]:\n A list of rows, where each row is a list of cell text values."
}, },
"parse_meta": { "parse_meta": {
"name": "parse_meta", "name": "parse_meta",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_meta", "path": "omniread.html.parser.HTMLParser.parse_meta",
"signature": "<bound method Function.signature of Function('parse_meta', 172, 199)>", "signature": "<bound method Function.signature of Function('parse_meta', 177, 205)>",
"docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document\n - This includes: Document title, `<meta>` tag name/property content mappings" "docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document.\n - This includes: Document title, `<meta>` tag name/property to\n content mappings."
} }
} }
}, },
@@ -331,7 +331,7 @@
"kind": "module", "kind": "module",
"path": "omniread.html.scraper", "path": "omniread.html.scraper",
"signature": null, "signature": null,
"docstring": "HTML scraping implementation for OmniRead.\n\n---\n\n## Summary\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting", "docstring": "# Summary\n\nHTML scraping implementation for OmniRead.\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting",
"members": { "members": {
"httpx": { "httpx": {
"name": "httpx", "name": "httpx",
@@ -366,7 +366,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.Content", "path": "omniread.html.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -403,7 +403,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.ContentType", "path": "omniread.html.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -440,14 +440,14 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.BaseScraper", "path": "omniread.html.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.BaseScraper.fetch", "path": "omniread.html.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -455,8 +455,8 @@
"name": "HTMLScraper", "name": "HTMLScraper",
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.HTMLScraper", "path": "omniread.html.scraper.HTMLScraper",
"signature": "<bound method Class.signature of Class('HTMLScraper', 30, 139)>", "signature": "<bound method Class.signature of Class('HTMLScraper', 30, 143)>",
"docstring": "Base HTML scraper using httpx.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n - Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n\n **Constraints:**\n \n - The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.",
"members": { "members": {
"content_type": { "content_type": {
"name": "content_type", "name": "content_type",
@@ -469,14 +469,14 @@
"name": "validate_content_type", "name": "validate_content_type",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.HTMLScraper.validate_content_type", "path": "omniread.html.scraper.HTMLScraper.validate_content_type",
"signature": "<bound method Function.signature of Function('validate_content_type', 74, 98)>", "signature": "<bound method Function.signature of Function('validate_content_type', 78, 102)>",
"docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content." "docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content."
}, },
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.HTMLScraper.fetch", "path": "omniread.html.scraper.HTMLScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 100, 139)>", "signature": "<bound method Function.signature of Function('fetch', 104, 143)>",
"docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML." "docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML."
} }
} }

View File

@@ -2,7 +2,7 @@
"module": "omniread.html.parser", "module": "omniread.html.parser",
"content": { "content": {
"path": "omniread.html.parser", "path": "omniread.html.parser",
"docstring": "HTML parser base implementations for OmniRead.\n\n---\n\n## Summary\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.", "docstring": "# Summary\n\nHTML parser base implementations for OmniRead.\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.",
"objects": { "objects": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -58,7 +58,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.ContentType", "path": "omniread.html.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -95,7 +95,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.Content", "path": "omniread.html.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -132,7 +132,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.BaseParser", "path": "omniread.html.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -153,7 +153,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.parser.BaseParser.parse", "path": "omniread.html.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -175,8 +175,8 @@
"name": "HTMLParser", "name": "HTMLParser",
"kind": "class", "kind": "class",
"path": "omniread.html.parser.HTMLParser", "path": "omniread.html.parser.HTMLParser",
"signature": "<bound method Class.signature of Class('HTMLParser', 31, 199)>", "signature": "<bound method Class.signature of Class('HTMLParser', 30, 205)>",
"docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n - Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n\n **Guarantees:**\n\n - Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n\n **Constraints:**\n \n - Concrete subclasses must define the output type `T` and implement the `parse()` method", "docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n - Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n\n **Guarantees:**\n\n - Accepts only HTML content.\n - Owns a parsed BeautifulSoup DOM tree.\n - Provides pure helper utilities for common HTML structures.\n\n **Constraints:**\n\n - Concrete subclasses must define the output type `T` and implement\n the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -189,36 +189,36 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse", "path": "omniread.html.parser.HTMLParser.parse",
"signature": "<bound method Function.signature of Function('parse', 77, 91)>", "signature": "<bound method Function.signature of Function('parse', 81, 96)>",
"docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a deterministic, structured output" "docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output."
}, },
"parse_div": { "parse_div": {
"name": "parse_div", "name": "parse_div",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_div", "path": "omniread.html.parser.HTMLParser.parse_div",
"signature": "<bound method Function.signature of Function('parse_div', 97, 112)>", "signature": "<bound method Function.signature of Function('parse_div', 102, 117)>",
"docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div (Tag):\n BeautifulSoup tag representing a `<div>`.\n separator (str, optional):\n String used to separate text nodes.\n\nReturns:\n str:\n Flattened, whitespace-normalized text content." "docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div (Tag):\n BeautifulSoup tag representing a `<div>`.\n separator (str, optional):\n String used to separate text nodes.\n\nReturns:\n str:\n Flattened, whitespace-normalized text content."
}, },
"parse_link": { "parse_link": {
"name": "parse_link", "name": "parse_link",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_link", "path": "omniread.html.parser.HTMLParser.parse_link",
"signature": "<bound method Function.signature of Function('parse_link', 114, 127)>", "signature": "<bound method Function.signature of Function('parse_link', 119, 132)>",
"docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a (Tag):\n BeautifulSoup tag representing an anchor.\n\nReturns:\n Optional[str]:\n The value of the `href` attribute, or None if absent." "docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a (Tag):\n BeautifulSoup tag representing an anchor.\n\nReturns:\n Optional[str]:\n The value of the `href` attribute, or None if absent."
}, },
"parse_table": { "parse_table": {
"name": "parse_table", "name": "parse_table",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_table", "path": "omniread.html.parser.HTMLParser.parse_table",
"signature": "<bound method Function.signature of Function('parse_table', 129, 150)>", "signature": "<bound method Function.signature of Function('parse_table', 134, 155)>",
"docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table (Tag):\n BeautifulSoup tag representing a `<table>`.\n\nReturns:\n list[list[str]]:\n A list of rows, where each row is a list of cell text values." "docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table (Tag):\n BeautifulSoup tag representing a `<table>`.\n\nReturns:\n list[list[str]]:\n A list of rows, where each row is a list of cell text values."
}, },
"parse_meta": { "parse_meta": {
"name": "parse_meta", "name": "parse_meta",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_meta", "path": "omniread.html.parser.HTMLParser.parse_meta",
"signature": "<bound method Function.signature of Function('parse_meta', 172, 199)>", "signature": "<bound method Function.signature of Function('parse_meta', 177, 205)>",
"docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document\n - This includes: Document title, `<meta>` tag name/property content mappings" "docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document.\n - This includes: Document title, `<meta>` tag name/property to\n content mappings."
} }
} }
}, },

View File

@@ -2,7 +2,7 @@
"module": "omniread.html.scraper", "module": "omniread.html.scraper",
"content": { "content": {
"path": "omniread.html.scraper", "path": "omniread.html.scraper",
"docstring": "HTML scraping implementation for OmniRead.\n\n---\n\n## Summary\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting", "docstring": "# Summary\n\nHTML scraping implementation for OmniRead.\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting",
"objects": { "objects": {
"httpx": { "httpx": {
"name": "httpx", "name": "httpx",
@@ -37,7 +37,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.Content", "path": "omniread.html.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -74,7 +74,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.ContentType", "path": "omniread.html.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -111,14 +111,14 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.BaseScraper", "path": "omniread.html.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.BaseScraper.fetch", "path": "omniread.html.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -126,8 +126,8 @@
"name": "HTMLScraper", "name": "HTMLScraper",
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.HTMLScraper", "path": "omniread.html.scraper.HTMLScraper",
"signature": "<bound method Class.signature of Class('HTMLScraper', 30, 139)>", "signature": "<bound method Class.signature of Class('HTMLScraper', 30, 143)>",
"docstring": "Base HTML scraper using httpx.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n - Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n\n **Constraints:**\n \n - The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.",
"members": { "members": {
"content_type": { "content_type": {
"name": "content_type", "name": "content_type",
@@ -140,14 +140,14 @@
"name": "validate_content_type", "name": "validate_content_type",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.HTMLScraper.validate_content_type", "path": "omniread.html.scraper.HTMLScraper.validate_content_type",
"signature": "<bound method Function.signature of Function('validate_content_type', 74, 98)>", "signature": "<bound method Function.signature of Function('validate_content_type', 78, 102)>",
"docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content." "docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content."
}, },
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.HTMLScraper.fetch", "path": "omniread.html.scraper.HTMLScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 100, 139)>", "signature": "<bound method Function.signature of Function('fetch', 104, 143)>",
"docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML." "docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML."
} }
} }

View File

@@ -2,14 +2,14 @@
"module": "omniread", "module": "omniread",
"content": { "content": {
"path": "omniread", "path": "omniread",
"docstring": "OmniRead — format-agnostic content acquisition and parsing framework.\n\n---\n\n## Summary\n\nOmniRead provides a **cleanly layered architecture** for fetching, parsing,\nand normalizing content from heterogeneous sources such as HTML documents\nand PDF files.\n\nThe library is structured around three core concepts:\n\n1. **Content**: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.\n2. **Scrapers**: Components responsible for *acquiring* raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.\n3. **Parsers**: Components responsible for *interpreting* acquired content and converting it into structured, typed representations.\n\nOmniRead deliberately separates these responsibilities to ensure:\n- Clear boundaries between IO and interpretation\n- Replaceable implementations per format\n- Predictable, testable behavior\n\n---\n\n## Installation\n\nInstall OmniRead using pip:\n\n pip install omniread\n\nOr with Poetry:\n\n poetry add omniread\n\n---\n\n## Quick start\n\nHTML example:\n\n from omniread import HTMLScraper, HTMLParser\n\n scraper = HTMLScraper()\n content = scraper.fetch(\"https://example.com\")\n\n class TitleParser(HTMLParser[str]):\n def parse(self) -> str:\n return self._soup.title.string\n\n parser = TitleParser(content)\n title = parser.parse()\n\nPDF example:\n\n from omniread import FileSystemPDFClient, PDFScraper, PDFParser\n from pathlib import Path\n\n client = FileSystemPDFClient()\n scraper = PDFScraper(client=client)\n content = scraper.fetch(Path(\"document.pdf\"))\n\n class TextPDFParser(PDFParser[str]):\n def parse(self) -> str:\n # implement PDF text extraction\n ...\n\n parser = TextPDFParser(content)\n result = parser.parse()\n\n---\n\n## Public API\n\nThis module re-exports the **recommended public entry points** of OmniRead.\nConsumers are encouraged to import from this namespace rather than from\nformat-specific submodules directly, unless advanced customization is\nrequired.\n\n**Core:**\n- Content\n- ContentType\n\n**HTML:**\n- HTMLScraper\n- HTMLParser\n\n**PDF:**\n- FileSystemPDFClient\n- PDFScraper\n- PDFParser\n\n**Core Philosophy:**\n`OmniRead` is designed as a **decoupled content engine**:\n1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other.\n2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract.\n3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON.\n\n---", "docstring": "# Summary\n\n`OmniRead` — format-agnostic content acquisition and parsing framework.\n\n`OmniRead` provides a **cleanly layered architecture** for fetching, parsing,\nand normalizing content from heterogeneous sources such as HTML documents\nand PDF files.\n\nThe library is structured around three core concepts:\n\n1. **`Content`**: A canonical, format-agnostic container representing raw content\n bytes and minimal contextual metadata.\n2. **`Scrapers`**: Components responsible for *acquiring* raw content from a\n source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret\n content.\n3. **`Parsers`**: Components responsible for *interpreting* acquired content and\n converting it into structured, typed representations.\n\n`OmniRead` deliberately separates these responsibilities to ensure:\n\n- Clear boundaries between IO and interpretation.\n- Replaceable implementations per format.\n- Predictable, testable behavior.\n\n# Installation\n\nInstall `OmniRead` using pip:\n\n```bash\npip install omniread\n```\n\nInstall OmniRead using Poetry:\n```bash\npoetry add omniread\n```\n\n---\n\n## Quick start\n\nExample:\n HTML example:\n ```python\n from omniread import HTMLScraper, HTMLParser\n\n scraper = HTMLScraper()\n content = scraper.fetch(\"https://example.com\")\n\n class TitleParser(HTMLParser[str]):\n def parse(self) -> str:\n return self._soup.title.string\n\n parser = TitleParser(content)\n title = parser.parse()\n ```\n\n PDF example:\n ```python\n from omniread import FileSystemPDFClient, PDFScraper, PDFParser\n from pathlib import Path\n\n client = FileSystemPDFClient()\n scraper = PDFScraper(client=client)\n content = scraper.fetch(Path(\"document.pdf\"))\n\n class TextPDFParser(PDFParser[str]):\n def parse(self) -> str:\n # implement PDF text extraction\n ...\n\n parser = TextPDFParser(content)\n result = parser.parse()\n ```\n\n---\n\n# Public API\n\nThis module re-exports the **recommended public entry points** of OmniRead.\nConsumers are encouraged to import from this namespace rather than from\nformat-specific submodules directly, unless advanced customization is\nrequired.\n\n- `Content`: Canonical content model.\n- `ContentType`: Supported media types.\n- `HTMLScraper`: HTTP-based HTML acquisition.\n- `HTMLParser`: Base parser for HTML DOM interpretation.\n- `FileSystemPDFClient`: Local filesystem PDF access.\n- `PDFScraper`: PDF-specific content acquisition.\n- `PDFParser`: Base parser for PDF binary interpretation.\n\n---\n\n# Core Philosophy\n\n`OmniRead` is designed as a **decoupled content engine**:\n\n1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither\n knows about the other.\n2. **Normalized Exchange**: All components communicate via the `Content` model,\n ensuring a consistent contract.\n3. **Format Agnosticism**: The core logic is independent of whether the input\n is HTML, PDF, or JSON.\n\n---",
"objects": { "objects": {
"Content": { "Content": {
"name": "Content", "name": "Content",
"kind": "class", "kind": "class",
"path": "omniread.Content", "path": "omniread.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -46,7 +46,7 @@
"kind": "class", "kind": "class",
"path": "omniread.ContentType", "path": "omniread.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -83,7 +83,7 @@
"kind": "class", "kind": "class",
"path": "omniread.HTMLScraper", "path": "omniread.HTMLScraper",
"signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.HTMLScraper')>", "signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.HTMLScraper')>",
"docstring": "Base HTML scraper using httpx.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n - Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n\n **Constraints:**\n \n - The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.",
"members": { "members": {
"content_type": { "content_type": {
"name": "content_type", "name": "content_type",
@@ -113,7 +113,7 @@
"kind": "class", "kind": "class",
"path": "omniread.HTMLParser", "path": "omniread.HTMLParser",
"signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.HTMLParser')>", "signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.HTMLParser')>",
"docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n - Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n\n **Guarantees:**\n\n - Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n\n **Constraints:**\n \n - Concrete subclasses must define the output type `T` and implement the `parse()` method", "docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n - Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n\n **Guarantees:**\n\n - Accepts only HTML content.\n - Owns a parsed BeautifulSoup DOM tree.\n - Provides pure helper utilities for common HTML structures.\n\n **Constraints:**\n\n - Concrete subclasses must define the output type `T` and implement\n the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -127,7 +127,7 @@
"kind": "function", "kind": "function",
"path": "omniread.HTMLParser.parse", "path": "omniread.HTMLParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>",
"docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a deterministic, structured output" "docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output."
}, },
"parse_div": { "parse_div": {
"name": "parse_div", "name": "parse_div",
@@ -155,7 +155,7 @@
"kind": "function", "kind": "function",
"path": "omniread.HTMLParser.parse_meta", "path": "omniread.HTMLParser.parse_meta",
"signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>", "signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>",
"docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document\n - This includes: Document title, `<meta>` tag name/property content mappings" "docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document.\n - This includes: Document title, `<meta>` tag name/property to\n content mappings."
} }
} }
}, },
@@ -164,7 +164,7 @@
"kind": "class", "kind": "class",
"path": "omniread.FileSystemPDFClient", "path": "omniread.FileSystemPDFClient",
"signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.FileSystemPDFClient')>", "signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.FileSystemPDFClient')>",
"docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns their raw binary contents", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -180,7 +180,7 @@
"kind": "class", "kind": "class",
"path": "omniread.PDFScraper", "path": "omniread.PDFScraper",
"signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.PDFScraper')>", "signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.PDFScraper')>",
"docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output into Content\n - Preserves caller-provided metadata\n\n **Constraints:**\n \n - The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -196,7 +196,7 @@
"kind": "class", "kind": "class",
"path": "omniread.PDFParser", "path": "omniread.PDFParser",
"signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.PDFParser')>", "signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.PDFParser')>",
"docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n\n **Constraints:**\n\n - Concrete implementations must: Define the output type `T`, implement the `parse()` method", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -210,7 +210,7 @@
"kind": "function", "kind": "function",
"path": "omniread.PDFParser.parse", "path": "omniread.PDFParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>",
"docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and return a deterministic, structured output" "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output."
} }
} }
}, },
@@ -219,14 +219,14 @@
"kind": "module", "kind": "module",
"path": "omniread.core", "path": "omniread.core",
"signature": null, "signature": null,
"docstring": "Core domain contracts for OmniRead.\n\n---\n\n## Summary\n\nThis package defines the **format-agnostic domain layer** of OmniRead.\nIt exposes canonical content models and abstract interfaces that are\nimplemented by format-specific modules (HTML, PDF, etc.).\n\nPublic exports from this package are considered **stable contracts** and\nare safe for downstream consumers to depend on.\n\nSubmodules:\n- content: Canonical content models and enums\n- parser: Abstract parsing contracts\n- scraper: Abstract scraping contracts\n\nFormat-specific behavior must not be introduced at this layer.\n\n---\n\n## Public API\n\n Content\n ContentType\n\n---", "docstring": "# Summary\n\nCore domain contracts for OmniRead.\n\nThis package defines the **format-agnostic domain layer** of OmniRead.\nIt exposes canonical content models and abstract interfaces that are\nimplemented by format-specific modules (HTML, PDF, etc.).\n\nPublic exports from this package are considered **stable contracts** and\nare safe for downstream consumers to depend on.\n\nSubmodules:\n\n- `content`: Canonical content models and enums.\n- `parser`: Abstract parsing contracts.\n- `scraper`: Abstract scraping contracts.\n\nFormat-specific behavior must not be introduced at this layer.\n\n---\n\n# Public API\n\n- `Content`\n- `ContentType`\n\n---",
"members": { "members": {
"Content": { "Content": {
"name": "Content", "name": "Content",
"kind": "class", "kind": "class",
"path": "omniread.core.Content", "path": "omniread.core.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -263,7 +263,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.ContentType", "path": "omniread.core.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -300,7 +300,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.BaseParser", "path": "omniread.core.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -321,7 +321,7 @@
"kind": "function", "kind": "function",
"path": "omniread.core.BaseParser.parse", "path": "omniread.core.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -337,14 +337,14 @@
"kind": "class", "kind": "class",
"path": "omniread.core.BaseScraper", "path": "omniread.core.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.core.BaseScraper.fetch", "path": "omniread.core.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -353,7 +353,7 @@
"kind": "module", "kind": "module",
"path": "omniread.core.content", "path": "omniread.core.content",
"signature": null, "signature": null,
"docstring": "Canonical content models for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.", "docstring": "# Summary\n\nCanonical content models for OmniRead.\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.",
"members": { "members": {
"Enum": { "Enum": {
"name": "Enum", "name": "Enum",
@@ -394,8 +394,8 @@
"name": "ContentType", "name": "ContentType",
"kind": "class", "kind": "class",
"path": "omniread.core.content.ContentType", "path": "omniread.core.content.ContentType",
"signature": "<bound method Class.signature of Class('ContentType', 21, 42)>", "signature": "<bound method Class.signature of Class('ContentType', 19, 42)>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -431,8 +431,8 @@
"name": "Content", "name": "Content",
"kind": "class", "kind": "class",
"path": "omniread.core.content.Content", "path": "omniread.core.content.Content",
"signature": "<bound method Class.signature of Class('Content', 45, 75)>", "signature": "<bound method Class.signature of Class('Content', 45, 77)>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -471,7 +471,7 @@
"kind": "module", "kind": "module",
"path": "omniread.core.parser", "path": "omniread.core.parser",
"signature": null, "signature": null,
"docstring": "Abstract parsing contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources", "docstring": "# Summary\n\nAbstract parsing contracts for OmniRead.\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources",
"members": { "members": {
"ABC": { "ABC": {
"name": "ABC", "name": "ABC",
@@ -513,7 +513,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.parser.Content", "path": "omniread.core.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -550,7 +550,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.parser.ContentType", "path": "omniread.core.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -593,8 +593,8 @@
"name": "BaseParser", "name": "BaseParser",
"kind": "class", "kind": "class",
"path": "omniread.core.parser.BaseParser", "path": "omniread.core.parser.BaseParser",
"signature": "<bound method Class.signature of Class('BaseParser', 30, 108)>", "signature": "<bound method Class.signature of Class('BaseParser', 30, 111)>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -614,14 +614,14 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.core.parser.BaseParser.parse", "path": "omniread.core.parser.BaseParser.parse",
"signature": "<bound method Function.signature of Function('parse', 73, 91)>", "signature": "<bound method Function.signature of Function('parse', 75, 94)>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
"kind": "function", "kind": "function",
"path": "omniread.core.parser.BaseParser.supports", "path": "omniread.core.parser.BaseParser.supports",
"signature": "<bound method Function.signature of Function('supports', 93, 108)>", "signature": "<bound method Function.signature of Function('supports', 96, 111)>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise." "docstring": "Check whether this parser supports the content's type.\n\nReturns:\n bool:\n True if the content type is supported; False otherwise."
} }
} }
@@ -633,7 +633,7 @@
"kind": "module", "kind": "module",
"path": "omniread.core.scraper", "path": "omniread.core.scraper",
"signature": null, "signature": null,
"docstring": "Abstract scraping contracts for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.", "docstring": "# Summary\n\nAbstract scraping contracts for OmniRead.\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.",
"members": { "members": {
"ABC": { "ABC": {
"name": "ABC", "name": "ABC",
@@ -675,7 +675,7 @@
"kind": "class", "kind": "class",
"path": "omniread.core.scraper.Content", "path": "omniread.core.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -711,15 +711,15 @@
"name": "BaseScraper", "name": "BaseScraper",
"kind": "class", "kind": "class",
"path": "omniread.core.scraper.BaseScraper", "path": "omniread.core.scraper.BaseScraper",
"signature": "<bound method Class.signature of Class('BaseScraper', 30, 76)>", "signature": "<bound method Class.signature of Class('BaseScraper', 30, 82)>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.core.scraper.BaseScraper.fetch", "path": "omniread.core.scraper.BaseScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 47, 76)>", "signature": "<bound method Function.signature of Function('fetch', 51, 82)>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
} }
@@ -732,14 +732,14 @@
"kind": "module", "kind": "module",
"path": "omniread.html", "path": "omniread.html",
"signature": null, "signature": null,
"docstring": "HTML format implementation for OmniRead.\n\n---\n\n## Summary\n\nThis package provides **HTML-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nIt includes:\n- HTML parsers that interpret HTML content\n- HTML scrapers that retrieve HTML documents\n\nThis package:\n- Implements, but does not redefine, core contracts\n- May contain HTML-specific behavior and edge-case handling\n- Produces canonical content models defined in `omniread.core.content`\n\nConsumers should depend on `omniread.core` interfaces wherever possible and\nuse this package only when HTML-specific behavior is required.\n\n---\n\n## Public API\n\n HTMLScraper\n HTMLParser\n\n---", "docstring": "# Summary\n\nHTML format implementation for OmniRead.\n\nThis package provides **HTML-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nIt includes:\n\n- HTML parsers that interpret HTML content.\n- HTML scrapers that retrieve HTML documents.\n\nKey characteristics:\n\n- Implements, but does not redefine, core contracts.\n- May contain HTML-specific behavior and edge-case handling.\n- Produces canonical content models defined in `omniread.core.content`.\n\nConsumers should depend on `omniread.core` interfaces wherever possible and\nuse this package only when HTML-specific behavior is required.\n\n---\n\n# Public API\n\n- `HTMLScraper`\n- `HTMLParser`\n\n---",
"members": { "members": {
"HTMLScraper": { "HTMLScraper": {
"name": "HTMLScraper", "name": "HTMLScraper",
"kind": "class", "kind": "class",
"path": "omniread.html.HTMLScraper", "path": "omniread.html.HTMLScraper",
"signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.scraper.HTMLScraper')>", "signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.scraper.HTMLScraper')>",
"docstring": "Base HTML scraper using httpx.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n - Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n\n **Constraints:**\n \n - The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.",
"members": { "members": {
"content_type": { "content_type": {
"name": "content_type", "name": "content_type",
@@ -769,7 +769,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.HTMLParser", "path": "omniread.html.HTMLParser",
"signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.parser.HTMLParser')>", "signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.parser.HTMLParser')>",
"docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n - Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n\n **Guarantees:**\n\n - Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n\n **Constraints:**\n \n - Concrete subclasses must define the output type `T` and implement the `parse()` method", "docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n - Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n\n **Guarantees:**\n\n - Accepts only HTML content.\n - Owns a parsed BeautifulSoup DOM tree.\n - Provides pure helper utilities for common HTML structures.\n\n **Constraints:**\n\n - Concrete subclasses must define the output type `T` and implement\n the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -783,7 +783,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.HTMLParser.parse", "path": "omniread.html.HTMLParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>",
"docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a deterministic, structured output" "docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output."
}, },
"parse_div": { "parse_div": {
"name": "parse_div", "name": "parse_div",
@@ -811,7 +811,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.HTMLParser.parse_meta", "path": "omniread.html.HTMLParser.parse_meta",
"signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>", "signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>",
"docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document\n - This includes: Document title, `<meta>` tag name/property content mappings" "docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document.\n - This includes: Document title, `<meta>` tag name/property to\n content mappings."
} }
} }
}, },
@@ -820,7 +820,7 @@
"kind": "module", "kind": "module",
"path": "omniread.html.parser", "path": "omniread.html.parser",
"signature": null, "signature": null,
"docstring": "HTML parser base implementations for OmniRead.\n\n---\n\n## Summary\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.", "docstring": "# Summary\n\nHTML parser base implementations for OmniRead.\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.",
"members": { "members": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -876,7 +876,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.ContentType", "path": "omniread.html.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -913,7 +913,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.Content", "path": "omniread.html.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -950,7 +950,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.parser.BaseParser", "path": "omniread.html.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -971,7 +971,7 @@
"kind": "function", "kind": "function",
"path": "omniread.html.parser.BaseParser.parse", "path": "omniread.html.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -993,8 +993,8 @@
"name": "HTMLParser", "name": "HTMLParser",
"kind": "class", "kind": "class",
"path": "omniread.html.parser.HTMLParser", "path": "omniread.html.parser.HTMLParser",
"signature": "<bound method Class.signature of Class('HTMLParser', 31, 199)>", "signature": "<bound method Class.signature of Class('HTMLParser', 30, 205)>",
"docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers\n - Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type\n\n **Guarantees:**\n\n - Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures\n\n **Constraints:**\n \n - Concrete subclasses must define the output type `T` and implement the `parse()` method", "docstring": "Base HTML parser.\n\nNotes:\n **Responsibilities:**\n\n - This class extends the core `BaseParser` with HTML-specific behavior,\n including DOM parsing via BeautifulSoup and reusable extraction helpers.\n - Provides reusable helpers for HTML extraction. Concrete parsers must\n explicitly define the return type.\n\n **Guarantees:**\n\n - Accepts only HTML content.\n - Owns a parsed BeautifulSoup DOM tree.\n - Provides pure helper utilities for common HTML structures.\n\n **Constraints:**\n\n - Concrete subclasses must define the output type `T` and implement\n the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -1007,36 +1007,36 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse", "path": "omniread.html.parser.HTMLParser.parse",
"signature": "<bound method Function.signature of Function('parse', 77, 91)>", "signature": "<bound method Function.signature of Function('parse', 81, 96)>",
"docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a deterministic, structured output" "docstring": "Fully parse the HTML content into structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the HTML DOM and return a\n deterministic, structured output."
}, },
"parse_div": { "parse_div": {
"name": "parse_div", "name": "parse_div",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_div", "path": "omniread.html.parser.HTMLParser.parse_div",
"signature": "<bound method Function.signature of Function('parse_div', 97, 112)>", "signature": "<bound method Function.signature of Function('parse_div', 102, 117)>",
"docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div (Tag):\n BeautifulSoup tag representing a `<div>`.\n separator (str, optional):\n String used to separate text nodes.\n\nReturns:\n str:\n Flattened, whitespace-normalized text content." "docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div (Tag):\n BeautifulSoup tag representing a `<div>`.\n separator (str, optional):\n String used to separate text nodes.\n\nReturns:\n str:\n Flattened, whitespace-normalized text content."
}, },
"parse_link": { "parse_link": {
"name": "parse_link", "name": "parse_link",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_link", "path": "omniread.html.parser.HTMLParser.parse_link",
"signature": "<bound method Function.signature of Function('parse_link', 114, 127)>", "signature": "<bound method Function.signature of Function('parse_link', 119, 132)>",
"docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a (Tag):\n BeautifulSoup tag representing an anchor.\n\nReturns:\n Optional[str]:\n The value of the `href` attribute, or None if absent." "docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a (Tag):\n BeautifulSoup tag representing an anchor.\n\nReturns:\n Optional[str]:\n The value of the `href` attribute, or None if absent."
}, },
"parse_table": { "parse_table": {
"name": "parse_table", "name": "parse_table",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_table", "path": "omniread.html.parser.HTMLParser.parse_table",
"signature": "<bound method Function.signature of Function('parse_table', 129, 150)>", "signature": "<bound method Function.signature of Function('parse_table', 134, 155)>",
"docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table (Tag):\n BeautifulSoup tag representing a `<table>`.\n\nReturns:\n list[list[str]]:\n A list of rows, where each row is a list of cell text values." "docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table (Tag):\n BeautifulSoup tag representing a `<table>`.\n\nReturns:\n list[list[str]]:\n A list of rows, where each row is a list of cell text values."
}, },
"parse_meta": { "parse_meta": {
"name": "parse_meta", "name": "parse_meta",
"kind": "function", "kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_meta", "path": "omniread.html.parser.HTMLParser.parse_meta",
"signature": "<bound method Function.signature of Function('parse_meta', 172, 199)>", "signature": "<bound method Function.signature of Function('parse_meta', 177, 205)>",
"docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document\n - This includes: Document title, `<meta>` tag name/property content mappings" "docstring": "Extract high-level metadata from the HTML document.\n\nReturns:\n dict[str, Any]:\n Dictionary containing extracted metadata.\n\nNotes:\n **Responsibilities:**\n\n - Extract high-level metadata from the HTML document.\n - This includes: Document title, `<meta>` tag name/property to\n content mappings."
} }
} }
}, },
@@ -1061,7 +1061,7 @@
"kind": "module", "kind": "module",
"path": "omniread.html.scraper", "path": "omniread.html.scraper",
"signature": null, "signature": null,
"docstring": "HTML scraping implementation for OmniRead.\n\n---\n\n## Summary\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting", "docstring": "# Summary\n\nHTML scraping implementation for OmniRead.\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting",
"members": { "members": {
"httpx": { "httpx": {
"name": "httpx", "name": "httpx",
@@ -1096,7 +1096,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.Content", "path": "omniread.html.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -1133,7 +1133,7 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.ContentType", "path": "omniread.html.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -1170,14 +1170,14 @@
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.BaseScraper", "path": "omniread.html.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.BaseScraper.fetch", "path": "omniread.html.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -1185,8 +1185,8 @@
"name": "HTMLScraper", "name": "HTMLScraper",
"kind": "class", "kind": "class",
"path": "omniread.html.scraper.HTMLScraper", "path": "omniread.html.scraper.HTMLScraper",
"signature": "<bound method Class.signature of Class('HTMLScraper', 30, 139)>", "signature": "<bound method Class.signature of Class('HTMLScraper', 30, 143)>",
"docstring": "Base HTML scraper using httpx.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object\n - Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata\n\n **Constraints:**\n \n - The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.",
"members": { "members": {
"content_type": { "content_type": {
"name": "content_type", "name": "content_type",
@@ -1199,14 +1199,14 @@
"name": "validate_content_type", "name": "validate_content_type",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.HTMLScraper.validate_content_type", "path": "omniread.html.scraper.HTMLScraper.validate_content_type",
"signature": "<bound method Function.signature of Function('validate_content_type', 74, 98)>", "signature": "<bound method Function.signature of Function('validate_content_type', 78, 102)>",
"docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content." "docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content."
}, },
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.html.scraper.HTMLScraper.fetch", "path": "omniread.html.scraper.HTMLScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 100, 139)>", "signature": "<bound method Function.signature of Function('fetch', 104, 143)>",
"docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML." "docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML."
} }
} }
@@ -1220,14 +1220,14 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf", "path": "omniread.pdf",
"signature": null, "signature": null,
"docstring": "PDF format implementation for OmniRead.\n\n---\n\n## Summary\n\nThis package provides **PDF-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nUnlike HTML, PDF handling requires an explicit client layer for document\naccess. This package therefore includes:\n- PDF clients for acquiring raw PDF data\n- PDF scrapers that coordinate client access\n- PDF parsers that extract structured content from PDF binaries\n\nPublic exports from this package represent the supported PDF pipeline\nand are safe for consumers to import directly when working with PDFs.\n\n---\n\n## Public API\n\n FileSystemPDFClient\n PDFScraper\n PDFParser\n\n---", "docstring": "# Summary\n\nPDF format implementation for OmniRead.\n\nThis package provides **PDF-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nUnlike HTML, PDF handling requires an explicit client layer for document\naccess. This package therefore includes:\n\n- PDF clients for acquiring raw PDF data.\n- PDF scrapers that coordinate client access.\n- PDF parsers that extract structured content from PDF binaries.\n\nPublic exports from this package represent the supported PDF pipeline\nand are safe for consumers to import directly when working with PDFs.\n\n---\n\n# Public API\n\n- `FileSystemPDFClient`\n- `PDFScraper`\n- `PDFParser`\n\n---",
"members": { "members": {
"FileSystemPDFClient": { "FileSystemPDFClient": {
"name": "FileSystemPDFClient", "name": "FileSystemPDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.FileSystemPDFClient", "path": "omniread.pdf.FileSystemPDFClient",
"signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.client.FileSystemPDFClient')>", "signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.client.FileSystemPDFClient')>",
"docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns their raw binary contents", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -1243,7 +1243,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.PDFScraper", "path": "omniread.pdf.PDFScraper",
"signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.scraper.PDFScraper')>", "signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.scraper.PDFScraper')>",
"docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output into Content\n - Preserves caller-provided metadata\n\n **Constraints:**\n \n - The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -1259,7 +1259,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.PDFParser", "path": "omniread.pdf.PDFParser",
"signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.parser.PDFParser')>", "signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.parser.PDFParser')>",
"docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n\n **Constraints:**\n\n - Concrete implementations must: Define the output type `T`, implement the `parse()` method", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -1273,7 +1273,7 @@
"kind": "function", "kind": "function",
"path": "omniread.pdf.PDFParser.parse", "path": "omniread.pdf.PDFParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>",
"docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and return a deterministic, structured output" "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output."
} }
} }
}, },
@@ -1282,7 +1282,7 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf.client", "path": "omniread.pdf.client",
"signature": null, "signature": null,
"docstring": "PDF client abstractions for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems", "docstring": "# Summary\n\nPDF client abstractions for OmniRead.\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems",
"members": { "members": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -1316,14 +1316,14 @@
"name": "BasePDFClient", "name": "BasePDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.client.BasePDFClient", "path": "omniread.pdf.client.BasePDFClient",
"signature": "<bound method Class.signature of Class('BasePDFClient', 26, 54)>", "signature": "<bound method Class.signature of Class('BasePDFClient', 25, 57)>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.client.BasePDFClient.fetch", "path": "omniread.pdf.client.BasePDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 37, 54)>", "signature": "<bound method Function.signature of Function('fetch', 40, 57)>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation." "docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation."
} }
} }
@@ -1332,14 +1332,14 @@
"name": "FileSystemPDFClient", "name": "FileSystemPDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.client.FileSystemPDFClient", "path": "omniread.pdf.client.FileSystemPDFClient",
"signature": "<bound method Class.signature of Class('FileSystemPDFClient', 57, 92)>", "signature": "<bound method Class.signature of Class('FileSystemPDFClient', 60, 96)>",
"docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns their raw binary contents", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.client.FileSystemPDFClient.fetch", "path": "omniread.pdf.client.FileSystemPDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 67, 92)>", "signature": "<bound method Function.signature of Function('fetch', 71, 96)>",
"docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file." "docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file."
} }
} }
@@ -1351,7 +1351,7 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf.parser", "path": "omniread.pdf.parser",
"signature": null, "signature": null,
"docstring": "PDF parser base implementations for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.", "docstring": "# Summary\n\nPDF parser base implementations for OmniRead.\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.",
"members": { "members": {
"Generic": { "Generic": {
"name": "Generic", "name": "Generic",
@@ -1379,7 +1379,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.ContentType", "path": "omniread.pdf.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -1416,7 +1416,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.BaseParser", "path": "omniread.pdf.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -1437,7 +1437,7 @@
"kind": "function", "kind": "function",
"path": "omniread.pdf.parser.BaseParser.parse", "path": "omniread.pdf.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -1459,8 +1459,8 @@
"name": "PDFParser", "name": "PDFParser",
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.PDFParser", "path": "omniread.pdf.parser.PDFParser",
"signature": "<bound method Class.signature of Class('PDFParser', 24, 61)>", "signature": "<bound method Class.signature of Class('PDFParser', 22, 62)>",
"docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n\n **Constraints:**\n\n - Concrete implementations must: Define the output type `T`, implement the `parse()` method", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -1473,8 +1473,8 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.pdf.parser.PDFParser.parse", "path": "omniread.pdf.parser.PDFParser.parse",
"signature": "<bound method Function.signature of Function('parse', 43, 61)>", "signature": "<bound method Function.signature of Function('parse', 43, 62)>",
"docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and return a deterministic, structured output" "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output."
} }
} }
} }
@@ -1485,7 +1485,7 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf.scraper", "path": "omniread.pdf.scraper",
"signature": null, "signature": null,
"docstring": "PDF scraping implementation for OmniRead.\n\n---\n\n## Summary\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.", "docstring": "# Summary\n\nPDF scraping implementation for OmniRead.\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.",
"members": { "members": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -1513,7 +1513,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.Content", "path": "omniread.pdf.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -1550,7 +1550,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.ContentType", "path": "omniread.pdf.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -1587,14 +1587,14 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.BaseScraper", "path": "omniread.pdf.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.scraper.BaseScraper.fetch", "path": "omniread.pdf.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -1603,7 +1603,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.BasePDFClient", "path": "omniread.pdf.scraper.BasePDFClient",
"signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>", "signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -1618,8 +1618,8 @@
"name": "PDFScraper", "name": "PDFScraper",
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.PDFScraper", "path": "omniread.pdf.scraper.PDFScraper",
"signature": "<bound method Class.signature of Class('PDFScraper', 22, 77)>", "signature": "<bound method Class.signature of Class('PDFScraper', 20, 77)>",
"docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output into Content\n - Preserves caller-provided metadata\n\n **Constraints:**\n \n - The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",

View File

@@ -2,7 +2,7 @@
"module": "omniread.pdf.client", "module": "omniread.pdf.client",
"content": { "content": {
"path": "omniread.pdf.client", "path": "omniread.pdf.client",
"docstring": "PDF client abstractions for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems", "docstring": "# Summary\n\nPDF client abstractions for OmniRead.\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems",
"objects": { "objects": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -36,14 +36,14 @@
"name": "BasePDFClient", "name": "BasePDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.client.BasePDFClient", "path": "omniread.pdf.client.BasePDFClient",
"signature": "<bound method Class.signature of Class('BasePDFClient', 26, 54)>", "signature": "<bound method Class.signature of Class('BasePDFClient', 25, 57)>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.client.BasePDFClient.fetch", "path": "omniread.pdf.client.BasePDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 37, 54)>", "signature": "<bound method Function.signature of Function('fetch', 40, 57)>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation." "docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation."
} }
} }
@@ -52,14 +52,14 @@
"name": "FileSystemPDFClient", "name": "FileSystemPDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.client.FileSystemPDFClient", "path": "omniread.pdf.client.FileSystemPDFClient",
"signature": "<bound method Class.signature of Class('FileSystemPDFClient', 57, 92)>", "signature": "<bound method Class.signature of Class('FileSystemPDFClient', 60, 96)>",
"docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns their raw binary contents", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.client.FileSystemPDFClient.fetch", "path": "omniread.pdf.client.FileSystemPDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 67, 92)>", "signature": "<bound method Function.signature of Function('fetch', 71, 96)>",
"docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file." "docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file."
} }
} }

View File

@@ -2,14 +2,14 @@
"module": "omniread.pdf", "module": "omniread.pdf",
"content": { "content": {
"path": "omniread.pdf", "path": "omniread.pdf",
"docstring": "PDF format implementation for OmniRead.\n\n---\n\n## Summary\n\nThis package provides **PDF-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nUnlike HTML, PDF handling requires an explicit client layer for document\naccess. This package therefore includes:\n- PDF clients for acquiring raw PDF data\n- PDF scrapers that coordinate client access\n- PDF parsers that extract structured content from PDF binaries\n\nPublic exports from this package represent the supported PDF pipeline\nand are safe for consumers to import directly when working with PDFs.\n\n---\n\n## Public API\n\n FileSystemPDFClient\n PDFScraper\n PDFParser\n\n---", "docstring": "# Summary\n\nPDF format implementation for OmniRead.\n\nThis package provides **PDF-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nUnlike HTML, PDF handling requires an explicit client layer for document\naccess. This package therefore includes:\n\n- PDF clients for acquiring raw PDF data.\n- PDF scrapers that coordinate client access.\n- PDF parsers that extract structured content from PDF binaries.\n\nPublic exports from this package represent the supported PDF pipeline\nand are safe for consumers to import directly when working with PDFs.\n\n---\n\n# Public API\n\n- `FileSystemPDFClient`\n- `PDFScraper`\n- `PDFParser`\n\n---",
"objects": { "objects": {
"FileSystemPDFClient": { "FileSystemPDFClient": {
"name": "FileSystemPDFClient", "name": "FileSystemPDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.FileSystemPDFClient", "path": "omniread.pdf.FileSystemPDFClient",
"signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.client.FileSystemPDFClient')>", "signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.client.FileSystemPDFClient')>",
"docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns their raw binary contents", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -25,7 +25,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.PDFScraper", "path": "omniread.pdf.PDFScraper",
"signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.scraper.PDFScraper')>", "signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.scraper.PDFScraper')>",
"docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output into Content\n - Preserves caller-provided metadata\n\n **Constraints:**\n \n - The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -41,7 +41,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.PDFParser", "path": "omniread.pdf.PDFParser",
"signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.parser.PDFParser')>", "signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.parser.PDFParser')>",
"docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n\n **Constraints:**\n\n - Concrete implementations must: Define the output type `T`, implement the `parse()` method", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -55,7 +55,7 @@
"kind": "function", "kind": "function",
"path": "omniread.pdf.PDFParser.parse", "path": "omniread.pdf.PDFParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>",
"docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and return a deterministic, structured output" "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output."
} }
} }
}, },
@@ -64,7 +64,7 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf.client", "path": "omniread.pdf.client",
"signature": null, "signature": null,
"docstring": "PDF client abstractions for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems", "docstring": "# Summary\n\nPDF client abstractions for OmniRead.\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems",
"members": { "members": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -98,14 +98,14 @@
"name": "BasePDFClient", "name": "BasePDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.client.BasePDFClient", "path": "omniread.pdf.client.BasePDFClient",
"signature": "<bound method Class.signature of Class('BasePDFClient', 26, 54)>", "signature": "<bound method Class.signature of Class('BasePDFClient', 25, 57)>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.client.BasePDFClient.fetch", "path": "omniread.pdf.client.BasePDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 37, 54)>", "signature": "<bound method Function.signature of Function('fetch', 40, 57)>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation." "docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source (Any):\n Identifier of the PDF location, such as a file path, object storage key, or remote reference.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n Exception:\n Retrieval-specific errors defined by the implementation."
} }
} }
@@ -114,14 +114,14 @@
"name": "FileSystemPDFClient", "name": "FileSystemPDFClient",
"kind": "class", "kind": "class",
"path": "omniread.pdf.client.FileSystemPDFClient", "path": "omniread.pdf.client.FileSystemPDFClient",
"signature": "<bound method Class.signature of Class('FileSystemPDFClient', 57, 92)>", "signature": "<bound method Class.signature of Class('FileSystemPDFClient', 60, 96)>",
"docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns their raw binary contents", "docstring": "PDF client that reads from the local filesystem.\n\nNotes:\n **Guarantees:**\n\n - This client reads PDF files directly from the disk and returns\n their raw binary contents.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.client.FileSystemPDFClient.fetch", "path": "omniread.pdf.client.FileSystemPDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 67, 92)>", "signature": "<bound method Function.signature of Function('fetch', 71, 96)>",
"docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file." "docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path (Path):\n Filesystem path to the PDF file.\n\nReturns:\n bytes:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError:\n If the path does not exist.\n ValueError:\n If the path exists but is not a file."
} }
} }
@@ -133,7 +133,7 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf.parser", "path": "omniread.pdf.parser",
"signature": null, "signature": null,
"docstring": "PDF parser base implementations for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.", "docstring": "# Summary\n\nPDF parser base implementations for OmniRead.\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.",
"members": { "members": {
"Generic": { "Generic": {
"name": "Generic", "name": "Generic",
@@ -161,7 +161,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.ContentType", "path": "omniread.pdf.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -198,7 +198,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.BaseParser", "path": "omniread.pdf.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -219,7 +219,7 @@
"kind": "function", "kind": "function",
"path": "omniread.pdf.parser.BaseParser.parse", "path": "omniread.pdf.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -241,8 +241,8 @@
"name": "PDFParser", "name": "PDFParser",
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.PDFParser", "path": "omniread.pdf.parser.PDFParser",
"signature": "<bound method Class.signature of Class('PDFParser', 24, 61)>", "signature": "<bound method Class.signature of Class('PDFParser', 22, 62)>",
"docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n\n **Constraints:**\n\n - Concrete implementations must: Define the output type `T`, implement the `parse()` method", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -255,8 +255,8 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.pdf.parser.PDFParser.parse", "path": "omniread.pdf.parser.PDFParser.parse",
"signature": "<bound method Function.signature of Function('parse', 43, 61)>", "signature": "<bound method Function.signature of Function('parse', 43, 62)>",
"docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and return a deterministic, structured output" "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output."
} }
} }
} }
@@ -267,7 +267,7 @@
"kind": "module", "kind": "module",
"path": "omniread.pdf.scraper", "path": "omniread.pdf.scraper",
"signature": null, "signature": null,
"docstring": "PDF scraping implementation for OmniRead.\n\n---\n\n## Summary\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.", "docstring": "# Summary\n\nPDF scraping implementation for OmniRead.\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.",
"members": { "members": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -295,7 +295,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.Content", "path": "omniread.pdf.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -332,7 +332,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.ContentType", "path": "omniread.pdf.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -369,14 +369,14 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.BaseScraper", "path": "omniread.pdf.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.scraper.BaseScraper.fetch", "path": "omniread.pdf.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -385,7 +385,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.BasePDFClient", "path": "omniread.pdf.scraper.BasePDFClient",
"signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>", "signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -400,8 +400,8 @@
"name": "PDFScraper", "name": "PDFScraper",
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.PDFScraper", "path": "omniread.pdf.scraper.PDFScraper",
"signature": "<bound method Class.signature of Class('PDFScraper', 22, 77)>", "signature": "<bound method Class.signature of Class('PDFScraper', 20, 77)>",
"docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output into Content\n - Preserves caller-provided metadata\n\n **Constraints:**\n \n - The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",

View File

@@ -2,7 +2,7 @@
"module": "omniread.pdf.parser", "module": "omniread.pdf.parser",
"content": { "content": {
"path": "omniread.pdf.parser", "path": "omniread.pdf.parser",
"docstring": "PDF parser base implementations for OmniRead.\n\n---\n\n## Summary\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.", "docstring": "# Summary\n\nPDF parser base implementations for OmniRead.\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.",
"objects": { "objects": {
"Generic": { "Generic": {
"name": "Generic", "name": "Generic",
@@ -30,7 +30,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.ContentType", "path": "omniread.pdf.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -67,7 +67,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.BaseParser", "path": "omniread.pdf.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>", "signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the Content it is responsible for interpreting\n - Consumers may rely on early validation of content compatibility and type-stable return values from `parse()`\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`\n - Implementations must raise parsing-specific exceptions from `parse()`\n - Implementations must remain deterministic for a given input", "docstring": "Base interface for all parsers.\n\nNotes:\n **Guarantees:**\n\n - A parser is a self-contained object that owns the `Content` it is\n responsible for interpreting.\n - Consumers may rely on early validation of content compatibility\n and type-stable return values from `parse()`.\n\n **Responsibilities:**\n\n - Implementations must declare supported content types via `supported_types`.\n - Implementations must raise parsing-specific exceptions from `parse()`.\n - Implementations must remain deterministic for a given input.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -88,7 +88,7 @@
"kind": "function", "kind": "function",
"path": "omniread.pdf.parser.BaseParser.parse", "path": "omniread.pdf.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>", "signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and return a deterministic, structured output" "docstring": "Parse the owned content into structured output.\n\nReturns:\n T:\n Parsed, structured representation.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully consume the provided content and\n return a deterministic, structured output."
}, },
"supports": { "supports": {
"name": "supports", "name": "supports",
@@ -110,8 +110,8 @@
"name": "PDFParser", "name": "PDFParser",
"kind": "class", "kind": "class",
"path": "omniread.pdf.parser.PDFParser", "path": "omniread.pdf.parser.PDFParser",
"signature": "<bound method Class.signature of Class('PDFParser', 24, 61)>", "signature": "<bound method Class.signature of Class('PDFParser', 22, 62)>",
"docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies\n\n **Constraints:**\n\n - Concrete implementations must: Define the output type `T`, implement the `parse()` method", "docstring": "Base PDF parser.\n\nNotes:\n **Responsibilities:**\n\n - This class enforces PDF content-type compatibility and provides\n the extension point for implementing concrete PDF parsing strategies.\n\n **Constraints:**\n\n - Concrete implementations must define the output type `T` and\n implement the `parse()` method.",
"members": { "members": {
"supported_types": { "supported_types": {
"name": "supported_types", "name": "supported_types",
@@ -124,8 +124,8 @@
"name": "parse", "name": "parse",
"kind": "function", "kind": "function",
"path": "omniread.pdf.parser.PDFParser.parse", "path": "omniread.pdf.parser.PDFParser.parse",
"signature": "<bound method Function.signature of Function('parse', 43, 61)>", "signature": "<bound method Function.signature of Function('parse', 43, 62)>",
"docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and return a deterministic, structured output" "docstring": "Parse PDF content into a structured output.\n\nReturns:\n T:\n Parsed representation of type `T`.\n\nRaises:\n Exception:\n Parsing-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must fully interpret the PDF binary payload and\n return a deterministic, structured output."
} }
} }
} }

View File

@@ -2,7 +2,7 @@
"module": "omniread.pdf.scraper", "module": "omniread.pdf.scraper",
"content": { "content": {
"path": "omniread.pdf.scraper", "path": "omniread.pdf.scraper",
"docstring": "PDF scraping implementation for OmniRead.\n\n---\n\n## Summary\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.", "docstring": "# Summary\n\nPDF scraping implementation for OmniRead.\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.",
"objects": { "objects": {
"Any": { "Any": {
"name": "Any", "name": "Any",
@@ -30,7 +30,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.Content", "path": "omniread.pdf.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>", "signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type\n - This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.",
"members": { "members": {
"raw": { "raw": {
"name": "raw", "name": "raw",
@@ -67,7 +67,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.ContentType", "path": "omniread.pdf.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>", "signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the content source\n - It is primarily used for routing content to the appropriate parser or downstream consumer", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.",
"members": { "members": {
"HTML": { "HTML": {
"name": "HTML", "name": "HTML",
@@ -104,14 +104,14 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.BaseScraper", "path": "omniread.pdf.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>", "signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it\n - A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object\n - Scrapers define how content is obtained, not what the content means\n - Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
"kind": "function", "kind": "function",
"path": "omniread.pdf.scraper.BaseScraper.fetch", "path": "omniread.pdf.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>", "signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.)\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object" "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object."
} }
} }
}, },
@@ -120,7 +120,7 @@
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.BasePDFClient", "path": "omniread.pdf.scraper.BasePDFClient",
"signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>", "signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure", "docstring": "Abstract client responsible for retrieving PDF bytes.\n\nRetrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).\n\nNotes:\n **Responsibilities:**\n\n - Implementations must accept a source identifier appropriate to\n the backing store.\n - Return the full PDF binary payload.\n - Raise retrieval-specific errors on failure.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",
@@ -135,8 +135,8 @@
"name": "PDFScraper", "name": "PDFScraper",
"kind": "class", "kind": "class",
"path": "omniread.pdf.scraper.PDFScraper", "path": "omniread.pdf.scraper.PDFScraper",
"signature": "<bound method Class.signature of Class('PDFScraper', 22, 77)>", "signature": "<bound method Class.signature of Class('PDFScraper', 20, 77)>",
"docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output into Content\n - Preserves caller-provided metadata\n\n **Constraints:**\n \n - The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend", "docstring": "Scraper for PDF sources.\n\nNotes:\n **Responsibilities:**\n\n - Delegates byte retrieval to a PDF client and normalizes output\n into `Content`.\n - Preserves caller-provided metadata.\n\n **Constraints:**\n\n - The scraper does not perform parsing or interpretation.\n - Does not assume a specific storage backend.",
"members": { "members": {
"fetch": { "fetch": {
"name": "fetch", "name": "fetch",

View File

@@ -1,43 +1,48 @@
""" """
OmniRead — format-agnostic content acquisition and parsing framework. # Summary
--- `OmniRead` — format-agnostic content acquisition and parsing framework.
## Summary `OmniRead` provides a **cleanly layered architecture** for fetching, parsing,
OmniRead provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents and normalizing content from heterogeneous sources such as HTML documents
and PDF files. and PDF files.
The library is structured around three core concepts: The library is structured around three core concepts:
1. **Content**: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata. 1. **`Content`**: A canonical, format-agnostic container representing raw content
2. **Scrapers**: Components responsible for *acquiring* raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content. bytes and minimal contextual metadata.
3. **Parsers**: Components responsible for *interpreting* acquired content and converting it into structured, typed representations. 2. **`Scrapers`**: Components responsible for *acquiring* raw content from a
source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret
content.
3. **`Parsers`**: Components responsible for *interpreting* acquired content and
converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure: `OmniRead` deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation
- Replaceable implementations per format
- Predictable, testable behavior
--- - Clear boundaries between IO and interpretation.
- Replaceable implementations per format.
- Predictable, testable behavior.
## Installation # Installation
Install OmniRead using pip: Install `OmniRead` using pip:
```bash
pip install omniread pip install omniread
```
Or with Poetry: Install OmniRead using Poetry:
```bash
poetry add omniread poetry add omniread
```
--- ---
## Quick start ## Quick start
Example:
HTML example: HTML example:
```python
from omniread import HTMLScraper, HTMLParser from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper() scraper = HTMLScraper()
@@ -49,9 +54,10 @@ HTML example:
parser = TitleParser(content) parser = TitleParser(content)
title = parser.parse() title = parser.parse()
```
PDF example: PDF example:
```python
from omniread import FileSystemPDFClient, PDFScraper, PDFParser from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path from pathlib import Path
@@ -66,34 +72,37 @@ PDF example:
parser = TextPDFParser(content) parser = TextPDFParser(content)
result = parser.parse() result = parser.parse()
```
--- ---
## Public API # Public API
This module re-exports the **recommended public entry points** of OmniRead. This module re-exports the **recommended public entry points** of OmniRead.
Consumers are encouraged to import from this namespace rather than from Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is format-specific submodules directly, unless advanced customization is
required. required.
**Core:** - `Content`: Canonical content model.
- Content - `ContentType`: Supported media types.
- ContentType - `HTMLScraper`: HTTP-based HTML acquisition.
- `HTMLParser`: Base parser for HTML DOM interpretation.
- `FileSystemPDFClient`: Local filesystem PDF access.
- `PDFScraper`: PDF-specific content acquisition.
- `PDFParser`: Base parser for PDF binary interpretation.
**HTML:** ---
- HTMLScraper
- HTMLParser
**PDF:** # Core Philosophy
- FileSystemPDFClient
- PDFScraper
- PDFParser
**Core Philosophy:**
`OmniRead` is designed as a **decoupled content engine**: `OmniRead` is designed as a **decoupled content engine**:
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract. 1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither
3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON. knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model,
ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input
is HTML, PDF, or JSON.
--- ---
""" """

View File

@@ -1,10 +1,8 @@
""" """
# Summary
Core domain contracts for OmniRead. Core domain contracts for OmniRead.
---
## Summary
This package defines the **format-agnostic domain layer** of OmniRead. This package defines the **format-agnostic domain layer** of OmniRead.
It exposes canonical content models and abstract interfaces that are It exposes canonical content models and abstract interfaces that are
implemented by format-specific modules (HTML, PDF, etc.). implemented by format-specific modules (HTML, PDF, etc.).
@@ -13,18 +11,19 @@ Public exports from this package are considered **stable contracts** and
are safe for downstream consumers to depend on. are safe for downstream consumers to depend on.
Submodules: Submodules:
- content: Canonical content models and enums
- parser: Abstract parsing contracts - `content`: Canonical content models and enums.
- scraper: Abstract scraping contracts - `parser`: Abstract parsing contracts.
- `scraper`: Abstract scraping contracts.
Format-specific behavior must not be introduced at this layer. Format-specific behavior must not be introduced at this layer.
--- ---
## Public API # Public API
Content - `Content`
ContentType - `ContentType`
--- ---
""" """

View File

@@ -1,10 +1,8 @@
""" """
# Summary
Canonical content models for OmniRead. Canonical content models for OmniRead.
---
## Summary
This module defines the **format-agnostic content representation** used across This module defines the **format-agnostic content representation** used across
all parsers and scrapers in OmniRead. all parsers and scrapers in OmniRead.
@@ -25,8 +23,10 @@ class ContentType(str, Enum):
Notes: Notes:
**Guarantees:** **Guarantees:**
- This enum represents the declared or inferred media type of the content source - This enum represents the declared or inferred media type of the
- It is primarily used for routing content to the appropriate parser or downstream consumer content source.
- It is primarily used for routing content to the appropriate
parser or downstream consumer.
""" """
HTML = "text/html" HTML = "text/html"
@@ -50,8 +50,10 @@ class Content:
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- A `Content` instance represents a raw content payload along with minimal contextual metadata describing its origin and type - A `Content` instance represents a raw content payload along with
- This class is the primary exchange format between Scrapers, Parsers, and Downstream consumers minimal contextual metadata describing its origin and type.
- This class is the primary exchange format between scrapers,
parsers, and downstream consumers.
""" """
raw: bytes raw: bytes

View File

@@ -1,19 +1,19 @@
""" """
# Summary
Abstract parsing contracts for OmniRead. Abstract parsing contracts for OmniRead.
---
## Summary
This module defines the **format-agnostic parser interface** used to transform This module defines the **format-agnostic parser interface** used to transform
raw content into structured, typed representations. raw content into structured, typed representations.
Parsers are responsible for: Parsers are responsible for:
- Interpreting a single `Content` instance - Interpreting a single `Content` instance
- Validating compatibility with the content type - Validating compatibility with the content type
- Producing a structured output suitable for downstream consumers - Producing a structured output suitable for downstream consumers
Parsers are not responsible for: Parsers are not responsible for:
- Fetching or acquiring content - Fetching or acquiring content
- Performing retries or error recovery - Performing retries or error recovery
- Managing multiple content sources - Managing multiple content sources
@@ -34,14 +34,16 @@ class BaseParser(ABC, Generic[T]):
Notes: Notes:
**Guarantees:** **Guarantees:**
- A parser is a self-contained object that owns the Content it is responsible for interpreting - A parser is a self-contained object that owns the `Content` it is
- Consumers may rely on early validation of content compatibility and type-stable return values from `parse()` responsible for interpreting.
- Consumers may rely on early validation of content compatibility
and type-stable return values from `parse()`.
**Responsibilities:** **Responsibilities:**
- Implementations must declare supported content types via `supported_types` - Implementations must declare supported content types via `supported_types`.
- Implementations must raise parsing-specific exceptions from `parse()` - Implementations must raise parsing-specific exceptions from `parse()`.
- Implementations must remain deterministic for a given input - Implementations must remain deterministic for a given input.
""" """
supported_types: Set[ContentType] = set() supported_types: Set[ContentType] = set()
@@ -86,7 +88,8 @@ class BaseParser(ABC, Generic[T]):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Implementations must fully consume the provided content and return a deterministic, structured output - Implementations must fully consume the provided content and
return a deterministic, structured output.
""" """
raise NotImplementedError raise NotImplementedError

View File

@@ -1,19 +1,19 @@
""" """
# Summary
Abstract scraping contracts for OmniRead. Abstract scraping contracts for OmniRead.
---
## Summary
This module defines the **format-agnostic scraper interface** responsible for This module defines the **format-agnostic scraper interface** responsible for
acquiring raw content from external sources. acquiring raw content from external sources.
Scrapers are responsible for: Scrapers are responsible for:
- Locating and retrieving raw content bytes - Locating and retrieving raw content bytes
- Attaching minimal contextual metadata - Attaching minimal contextual metadata
- Returning normalized `Content` objects - Returning normalized `Content` objects
Scrapers are explicitly NOT responsible for: Scrapers are explicitly NOT responsible for:
- Parsing or interpreting content - Parsing or interpreting content
- Inferring structure or semantics - Inferring structure or semantics
- Performing content-type specific processing - Performing content-type specific processing
@@ -34,14 +34,18 @@ class BaseScraper(ABC):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it - A scraper is responsible ONLY for fetching raw content (bytes)
- A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a `Content` object from a source. It must not interpret or parse it.
- Scrapers define how content is obtained, not what the content means - A scraper is a stateless acquisition component that retrieves raw
- Implementations may vary in transport mechanism, authentication strategy, retry and backoff behavior content from a source and returns it as a `Content` object.
- Scrapers define how content is obtained, not what the content means.
- Implementations may vary in transport mechanism, authentication
strategy, retry and backoff behavior.
**Constraints:** **Constraints:**
- Implementations must not parse content, modify content semantics, or couple scraping logic to a specific parser - Implementations must not parse content, modify content semantics,
or couple scraping logic to a specific parser.
""" """
@abstractmethod @abstractmethod
@@ -56,9 +60,10 @@ class BaseScraper(ABC):
Args: Args:
source (str): source (str):
Location identifier (URL, file path, S3 URI, etc.) Location identifier (URL, file path, S3 URI, etc.).
metadata (Optional[Mapping[str, Any]], optional): metadata (Optional[Mapping[str, Any]], optional):
Optional hints for the scraper (headers, auth, etc.) Optional hints for the scraper (headers, auth, etc.).
Returns: Returns:
Content: Content:
@@ -71,6 +76,7 @@ class BaseScraper(ABC):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Implementations must retrieve the content referenced by `source` and return it as raw bytes wrapped in a `Content` object - Implementations must retrieve the content referenced by `source`
and return it as raw bytes wrapped in a `Content` object.
""" """
raise NotImplementedError raise NotImplementedError

View File

@@ -1,31 +1,31 @@
""" """
# Summary
HTML format implementation for OmniRead. HTML format implementation for OmniRead.
---
## Summary
This package provides **HTML-specific implementations** of the core OmniRead This package provides **HTML-specific implementations** of the core OmniRead
contracts defined in `omniread.core`. contracts defined in `omniread.core`.
It includes: It includes:
- HTML parsers that interpret HTML content
- HTML scrapers that retrieve HTML documents
This package: - HTML parsers that interpret HTML content.
- Implements, but does not redefine, core contracts - HTML scrapers that retrieve HTML documents.
- May contain HTML-specific behavior and edge-case handling
- Produces canonical content models defined in `omniread.core.content` Key characteristics:
- Implements, but does not redefine, core contracts.
- May contain HTML-specific behavior and edge-case handling.
- Produces canonical content models defined in `omniread.core.content`.
Consumers should depend on `omniread.core` interfaces wherever possible and Consumers should depend on `omniread.core` interfaces wherever possible and
use this package only when HTML-specific behavior is required. use this package only when HTML-specific behavior is required.
--- ---
## Public API # Public API
HTMLScraper - `HTMLScraper`
HTMLParser - `HTMLParser`
--- ---
""" """

View File

@@ -1,14 +1,13 @@
""" """
# Summary
HTML parser base implementations for OmniRead. HTML parser base implementations for OmniRead.
---
## Summary
This module provides reusable HTML parsing utilities built on top of This module provides reusable HTML parsing utilities built on top of
the abstract parser contracts defined in `omniread.core.parser`. the abstract parser contracts defined in `omniread.core.parser`.
It supplies: It supplies:
- Content-type enforcement for HTML inputs - Content-type enforcement for HTML inputs
- BeautifulSoup initialization and lifecycle management - BeautifulSoup initialization and lifecycle management
- Common helper methods for extracting structured data from HTML elements - Common helper methods for extracting structured data from HTML elements
@@ -35,16 +34,21 @@ class HTMLParser(BaseParser[T], Generic[T]):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- This class extends the core `BaseParser` with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers - This class extends the core `BaseParser` with HTML-specific behavior,
- Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type including DOM parsing via BeautifulSoup and reusable extraction helpers.
- Provides reusable helpers for HTML extraction. Concrete parsers must
explicitly define the return type.
**Guarantees:** **Guarantees:**
- Characteristics: Accepts only HTML content, owns a parsed BeautifulSoup DOM tree, provides pure helper utilities for common HTML structures - Accepts only HTML content.
- Owns a parsed BeautifulSoup DOM tree.
- Provides pure helper utilities for common HTML structures.
**Constraints:** **Constraints:**
- Concrete subclasses must define the output type `T` and implement the `parse()` method - Concrete subclasses must define the output type `T` and implement
the `parse()` method.
""" """
supported_types = {ContentType.HTML} supported_types = {ContentType.HTML}
@@ -86,7 +90,8 @@ class HTMLParser(BaseParser[T], Generic[T]):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Implementations must fully interpret the HTML DOM and return a deterministic, structured output - Implementations must fully interpret the HTML DOM and return a
deterministic, structured output.
""" """
raise NotImplementedError raise NotImplementedError
@@ -180,8 +185,9 @@ class HTMLParser(BaseParser[T], Generic[T]):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Extract high-level metadata from the HTML document - Extract high-level metadata from the HTML document.
- This includes: Document title, `<meta>` tag name/property → content mappings - This includes: Document title, `<meta>` tag name/property to
content mappings.
""" """
soup = self._soup soup = self._soup

View File

@@ -1,20 +1,20 @@
""" """
# Summary
HTML scraping implementation for OmniRead. HTML scraping implementation for OmniRead.
---
## Summary
This module provides an HTTP-based scraper for retrieving HTML documents. This module provides an HTTP-based scraper for retrieving HTML documents.
It implements the core `BaseScraper` contract using `httpx` as the transport It implements the core `BaseScraper` contract using `httpx` as the transport
layer. layer.
This scraper is responsible for: This scraper is responsible for:
- Fetching raw HTML bytes over HTTP(S) - Fetching raw HTML bytes over HTTP(S)
- Validating response content type - Validating response content type
- Attaching HTTP metadata to the returned content - Attaching HTTP metadata to the returned content
This scraper is not responsible for: This scraper is not responsible for:
- Parsing or interpreting HTML - Parsing or interpreting HTML
- Retrying failed requests - Retrying failed requests
- Managing crawl policies or rate limiting - Managing crawl policies or rate limiting
@@ -29,17 +29,21 @@ from omniread.core.scraper import BaseScraper
class HTMLScraper(BaseScraper): class HTMLScraper(BaseScraper):
""" """
Base HTML scraper using httpx. Base HTML scraper using `httpx`.
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a `Content` object - This scraper retrieves HTML documents over HTTP(S) and returns
- Fetches raw bytes and metadata only. The scraper uses `httpx.Client` for HTTP requests, enforces an HTML content type, preserves HTTP response metadata them as raw content wrapped in a `Content` object.
- Fetches raw bytes and metadata only.
- The scraper uses `httpx.Client` for HTTP requests, enforces an
HTML content type, and preserves HTTP response metadata.
**Constraints:** **Constraints:**
- The scraper does not: Parse HTML, perform retries or backoff, handle non-HTML responses - The scraper does not: Parse HTML, perform retries or backoff,
handle non-HTML responses.
""" """
def __init__( def __init__(

View File

@@ -1,29 +1,28 @@
""" """
# Summary
PDF format implementation for OmniRead. PDF format implementation for OmniRead.
---
## Summary
This package provides **PDF-specific implementations** of the core OmniRead This package provides **PDF-specific implementations** of the core OmniRead
contracts defined in `omniread.core`. contracts defined in `omniread.core`.
Unlike HTML, PDF handling requires an explicit client layer for document Unlike HTML, PDF handling requires an explicit client layer for document
access. This package therefore includes: access. This package therefore includes:
- PDF clients for acquiring raw PDF data
- PDF scrapers that coordinate client access - PDF clients for acquiring raw PDF data.
- PDF parsers that extract structured content from PDF binaries - PDF scrapers that coordinate client access.
- PDF parsers that extract structured content from PDF binaries.
Public exports from this package represent the supported PDF pipeline Public exports from this package represent the supported PDF pipeline
and are safe for consumers to import directly when working with PDFs. and are safe for consumers to import directly when working with PDFs.
--- ---
## Public API # Public API
FileSystemPDFClient - `FileSystemPDFClient`
PDFScraper - `PDFScraper`
PDFParser - `PDFParser`
--- ---
""" """

View File

@@ -1,10 +1,8 @@
""" """
# Summary
PDF client abstractions for OmniRead. PDF client abstractions for OmniRead.
---
## Summary
This module defines the **client layer** responsible for retrieving raw PDF This module defines the **client layer** responsible for retrieving raw PDF
bytes from a concrete backing store. bytes from a concrete backing store.
@@ -13,6 +11,7 @@ decoupled from scraping and parsing logic. They do not perform validation,
interpretation, or content extraction. interpretation, or content extraction.
Typical backing stores include: Typical backing stores include:
- Local filesystems - Local filesystems
- Object storage (S3, GCS, etc.) - Object storage (S3, GCS, etc.)
- Network file systems - Network file systems
@@ -25,13 +24,17 @@ from pathlib import Path
class BasePDFClient(ABC): class BasePDFClient(ABC):
""" """
Abstract client responsible for retrieving PDF bytes Abstract client responsible for retrieving PDF bytes.
from a specific backing store (filesystem, S3, FTP, etc.).
Retrieves bytes from a specific backing store (filesystem, S3, FTP, etc.).
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Implementations must accept a source identifier appropriate to the backing store, return the full PDF binary payload, and raise retrieval-specific errors on failure - Implementations must accept a source identifier appropriate to
the backing store.
- Return the full PDF binary payload.
- Raise retrieval-specific errors on failure.
""" """
@abstractmethod @abstractmethod
@@ -61,7 +64,8 @@ class FileSystemPDFClient(BasePDFClient):
Notes: Notes:
**Guarantees:** **Guarantees:**
- This client reads PDF files directly from the disk and returns their raw binary contents - This client reads PDF files directly from the disk and returns
their raw binary contents.
""" """
def fetch(self, path: Path) -> bytes: def fetch(self, path: Path) -> bytes:

View File

@@ -1,10 +1,8 @@
""" """
# Summary
PDF parser base implementations for OmniRead. PDF parser base implementations for OmniRead.
---
## Summary
This module defines the **PDF-specific parser contract**, extending the This module defines the **PDF-specific parser contract**, extending the
format-agnostic `BaseParser` with constraints appropriate for PDF content. format-agnostic `BaseParser` with constraints appropriate for PDF content.
@@ -28,11 +26,13 @@ class PDFParser(BaseParser[T], Generic[T]):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies - This class enforces PDF content-type compatibility and provides
the extension point for implementing concrete PDF parsing strategies.
**Constraints:** **Constraints:**
- Concrete implementations must: Define the output type `T`, implement the `parse()` method - Concrete implementations must define the output type `T` and
implement the `parse()` method.
""" """
supported_types = {ContentType.PDF} supported_types = {ContentType.PDF}
@@ -56,6 +56,7 @@ class PDFParser(BaseParser[T], Generic[T]):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Implementations must fully interpret the PDF binary payload and return a deterministic, structured output - Implementations must fully interpret the PDF binary payload and
return a deterministic, structured output.
""" """
raise NotImplementedError raise NotImplementedError

View File

@@ -1,10 +1,8 @@
""" """
# Summary
PDF scraping implementation for OmniRead. PDF scraping implementation for OmniRead.
---
## Summary
This module provides a PDF-specific scraper that coordinates PDF byte This module provides a PDF-specific scraper that coordinates PDF byte
retrieval via a client and normalizes the result into a `Content` object. retrieval via a client and normalizes the result into a `Content` object.
@@ -26,12 +24,14 @@ class PDFScraper(BaseScraper):
Notes: Notes:
**Responsibilities:** **Responsibilities:**
- Delegates byte retrieval to a PDF client and normalizes output into Content - Delegates byte retrieval to a PDF client and normalizes output
- Preserves caller-provided metadata into `Content`.
- Preserves caller-provided metadata.
**Constraints:** **Constraints:**
- The scraper: Does not perform parsing or interpretation, does not assume a specific storage backend - The scraper does not perform parsing or interpretation.
- Does not assume a specific storage backend.
""" """
def __init__(self, *, client: BasePDFClient): def __init__(self, *, client: BasePDFClient):