Compare commits

...

15 Commits

Author SHA1 Message Date
67a3074ab4 using doc-forge (#1)
Reviewed-on: #1
Co-authored-by: Vishesh 'ironeagle' Bangotra <aetoskia@gmail.com>
Co-committed-by: Vishesh 'ironeagle' Bangotra <aetoskia@gmail.com>
2026-01-22 11:27:56 +00:00
6808538485 added .drone.yml
All checks were successful
continuous-integration/drone Build is passing
2026-01-09 15:55:54 +05:30
fc29f49d41 python file to generate docs. useful for pycharm on windows 2026-01-09 15:52:27 +05:30
3d6655084f added doc packages in requirements.txt 2026-01-09 15:52:14 +05:30
5af411020c docs: add mkdocs configuration and API documentation structure
- docs(mkdocs): add mkdocs.yml with material theme and plugin configuration
- docs(mkdocs): configure navigation for core, html, and pdf modules
- docs(docs): add documentation root and homepage
- docs(docs): add core contracts documentation pages
- docs(docs): add html implementation documentation pages
- docs(docs): add pdf implementation documentation pages
- docs(docs): wire mkdocstrings directives for API reference rendering
2026-01-09 15:51:54 +05:30
7f1b0d9c10 docs: add contract-oriented docstrings across core, html, and pdf layers
- docs(core): document Content and ContentType canonical models
- docs(core): define BaseParser contract and parsing semantics
- docs(core): define BaseScraper contract and acquisition semantics
- docs(html): document HTML package purpose and scope
- docs(html): add HTMLParser base with DOM helpers and contracts
- docs(html): add HTTP-based HTMLScraper with content-type enforcement
- docs(pdf): document PDF package structure and public pipeline
- docs(pdf): add BasePDFClient abstraction and filesystem implementation
- docs(pdf): add PDFParser base contract for binary parsing
- docs(pdf): add PDFScraper coordinating client and Content normalization
- docs(api): expand top-level omniread module with install instructions and examples
2026-01-09 15:51:22 +05:30
b2173f3ef0 refactor(tests): use omniread public API instead of internal module imports
- Replace deep imports with top-level omniread exports in tests
- Ensure tests validate only the supported public API surface
- Align HTML and PDF tests with documented library usage
2026-01-02 19:02:20 +05:30
de67c7b0b1 feat(pdf): add PDF client, scraper, parser, and end-to-end tests
- Introduce PDF submodule with client, scraper, and generic parser
- Add filesystem PDF client and test-only mock routing
- Add end-to-end PDF scrape → parse tests with typed output
- Mirror HTML module architecture for consistency
- Expose PDF primitives via omniread public API
2026-01-02 18:59:36 +05:30
390eb22e1b moved html mocks to html sub folder and updated conftest.py to read from new location with better path and endpoint handling 2026-01-02 18:44:26 +05:30
358abc9b36 feat(api): expose core and html primitives via top-level package exports
- Re-export Content and ContentType from omniread.core
- Re-export HTMLScraper and HTMLParser from omniread.html
- Define explicit __all__ for stable public API surface
2026-01-02 18:36:29 +05:30
07293e4651 feat(testing): add end-to-end HTML scraping and parsing tests with typed parsers
- Add smart httpx MockTransport routing based on endpoint paths
- Render HTML fixtures via Jinja templates populated from JSON data
- Introduce explicit, typed HTML parsers for semantic and table-based content
- Add end-to-end tests covering scraper → content → parser → Pydantic models
- Enforce explicit output contracts and avoid default dict-based parsing
2026-01-02 18:31:34 +05:30
fa14a79ec9 simple test case 2026-01-02 18:20:03 +05:30
55245cf241 added validation for content type 2026-01-02 18:19:47 +05:30
202329e190 refactor(html-scraper): normalize Content-Type and inject httpx client
- Inject httpx.Client for testability and reuse
- Validate and normalize Content-Type header before returning Content
- Emit ContentType.HTML instead of raw header strings
- Avoid per-request client creation
- Preserve metadata while allowing caller overrides
2026-01-02 18:08:46 +05:30
f59024ddd5 added pydantic 2026-01-02 18:08:37 +05:30
65 changed files with 5677 additions and 22 deletions

129
.drone.yml Normal file
View File

@@ -0,0 +1,129 @@
---
kind: pipeline
type: docker
name: build-and-publish-pypi
platform:
os: linux
arch: arm64
workspace:
path: /drone/src
steps:
- name: check-version
image: curlimages/curl:latest
environment:
PIP_REPO_URL:
from_secret: PIP_REPO_URL
PIP_USERNAME:
from_secret: PIP_USERNAME
PIP_PASSWORD:
from_secret: PIP_PASSWORD
commands:
- PACKAGE_NAME=$(grep -E '^name\s*=' pyproject.toml | head -1 | cut -d'"' -f2)
- VERSION=$(grep -E '^version\s*=' pyproject.toml | head -1 | cut -d'"' -f2)
- echo "🔍 Checking if $PACKAGE_NAME==$VERSION exists on $PIP_REPO_URL ..."
- |
if curl -fsSL -u "$PIP_USERNAME:$PIP_PASSWORD" "$PIP_REPO_URL/simple/$PACKAGE_NAME/" | grep -q "$VERSION"; then
echo "✅ $PACKAGE_NAME==$VERSION already exists — skipping build."
exit 78
else
echo "🆕 New version detected: $PACKAGE_NAME==$VERSION"
fi
- name: build-package
image: python:3.13-slim
commands:
- pip install --upgrade pip build
- echo "📦 Building Python package..."
- python -m build
- ls -l dist
- name: upload-to-private-pypi
image: python:3.13-slim
environment:
PIP_REPO_URL:
from_secret: PIP_REPO_URL
PIP_USERNAME:
from_secret: PIP_USERNAME
PIP_PASSWORD:
from_secret: PIP_PASSWORD
commands:
- pip install --upgrade twine
- echo "🚀 Uploading to private PyPI at $PIP_REPO_URL ..."
- |
twine upload \
--repository-url "$PIP_REPO_URL" \
-u "$PIP_USERNAME" \
-p "$PIP_PASSWORD" \
dist/*
trigger:
event:
- tag
---
kind: pipeline
type: docker
name: backfill-pypi-from-tags
platform:
os: linux
arch: arm64
workspace:
path: /drone/src
steps:
- name: fetch-tags
image: alpine/git
commands:
- git fetch --tags --force
- name: build-and-upload-missing
image: python:3.13-slim
environment:
PIP_REPO_URL:
from_secret: PIP_REPO_URL
PIP_USERNAME:
from_secret: PIP_USERNAME
PIP_PASSWORD:
from_secret: PIP_PASSWORD
commands:
- apt-get update
- apt-get install -y git curl ca-certificates
- pip install --upgrade pip build twine
- |
set -e
PACKAGE_NAME=$(grep -E '^name\s*=' pyproject.toml | cut -d'"' -f2)
echo "📦 Package: $PACKAGE_NAME"
for TAG in $(git tag --sort=version:refname); do
VERSION="$TAG"
echo "🔁 Version: $VERSION"
if curl -fsSL -u "$PIP_USERNAME:$PIP_PASSWORD" \
"$PIP_REPO_URL/simple/$PACKAGE_NAME/" | grep -q "$VERSION"; then
echo "⏭️ Exists, skipping"
continue
fi
git checkout --force "$TAG"
echo "🏗️ Building $VERSION"
rm -rf dist
python -m build
echo "⬆️ Uploading $VERSION"
twine upload \
--repository-url "$PIP_REPO_URL" \
-u "$PIP_USERNAME" \
-p "$PIP_PASSWORD" \
dist/*
done
trigger:
event:
- custom

1
.gitignore vendored
View File

@@ -38,3 +38,4 @@ Thumbs.db
*.swo
*~
*.tmp
site

16
docforge.nav.yml Normal file
View File

@@ -0,0 +1,16 @@
home: omniread/index.md
groups:
Core API:
- omniread/core/index.md
- omniread/core/content.md
- omniread/core/parser.md
- omniread/core/scraper.md
HTML Handling:
- omniread/html/index.md
- omniread/html/parser.md
- omniread/html/scraper.md
PDF Handling:
- omniread/pdf/index.md
- omniread/pdf/client.md
- omniread/pdf/parser.md
- omniread/pdf/scraper.md

View File

@@ -0,0 +1,3 @@
# Content
::: omniread.core.content

View File

@@ -0,0 +1,3 @@
# Core
::: omniread.core

View File

@@ -0,0 +1,3 @@
# Parser
::: omniread.core.parser

View File

@@ -0,0 +1,3 @@
# Scraper
::: omniread.core.scraper

View File

@@ -0,0 +1,3 @@
# Html
::: omniread.html

View File

@@ -0,0 +1,3 @@
# Parser
::: omniread.html.parser

View File

@@ -0,0 +1,3 @@
# Scraper
::: omniread.html.scraper

3
docs/omniread/index.md Normal file
View File

@@ -0,0 +1,3 @@
# Omniread
::: omniread

View File

@@ -0,0 +1,3 @@
# Client
::: omniread.pdf.client

View File

@@ -0,0 +1,3 @@
# Pdf
::: omniread.pdf

View File

@@ -0,0 +1,3 @@
# Parser
::: omniread.pdf.parser

View File

@@ -0,0 +1,3 @@
# Scraper
::: omniread.pdf.scraper

6
mcp_docs/index.json Normal file
View File

@@ -0,0 +1,6 @@
{
"project": "omniread",
"type": "docforge-model",
"modules_count": 12,
"source": "docforge"
}

View File

@@ -0,0 +1,118 @@
{
"module": "omniread.core.content",
"content": {
"path": "omniread.core.content",
"docstring": "Canonical content models for OmniRead.\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.",
"objects": {
"Enum": {
"name": "Enum",
"kind": "alias",
"path": "omniread.core.content.Enum",
"signature": "<bound method Alias.signature of Alias('Enum', 'enum.Enum')>",
"docstring": null
},
"dataclass": {
"name": "dataclass",
"kind": "alias",
"path": "omniread.core.content.dataclass",
"signature": "<bound method Alias.signature of Alias('dataclass', 'dataclasses.dataclass')>",
"docstring": null
},
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.core.content.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.core.content.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.core.content.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.core.content.ContentType",
"signature": "<bound method Class.signature of Class('ContentType', 17, 36)>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.core.content.ContentType.HTML",
"signature": null,
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.core.content.ContentType.PDF",
"signature": null,
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.core.content.ContentType.JSON",
"signature": null,
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.core.content.ContentType.XML",
"signature": null,
"docstring": "XML document content."
}
}
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.content.Content",
"signature": "<bound method Class.signature of Class('Content', 39, 63)>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.content.Content.raw",
"signature": null,
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.content.Content.source",
"signature": null,
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.content.Content.content_type",
"signature": null,
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.content.Content.metadata",
"signature": null,
"docstring": null
}
}
}
}
}
}

View File

@@ -0,0 +1,513 @@
{
"module": "omniread.core",
"content": {
"path": "omniread.core",
"docstring": "Core domain contracts for OmniRead.\n\nThis package defines the **format-agnostic domain layer** of OmniRead.\nIt exposes canonical content models and abstract interfaces that are\nimplemented by format-specific modules (HTML, PDF, etc.).\n\nPublic exports from this package are considered **stable contracts** and\nare safe for downstream consumers to depend on.\n\nSubmodules:\n- content: Canonical content models and enums\n- parser: Abstract parsing contracts\n- scraper: Abstract scraping contracts\n\nFormat-specific behavior must not be introduced at this layer.",
"objects": {
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.core.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.core.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.core.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.core.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.core.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.core.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.core.BaseParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.core.parser.BaseParser.supported_types')>",
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.core.BaseParser.content",
"signature": "<bound method Alias.signature of Alias('content', 'omniread.core.parser.BaseParser.content')>",
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.core.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.core.BaseParser.supports",
"signature": "<bound method Alias.signature of Alias('supports', 'omniread.core.parser.BaseParser.supports')>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.core.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.core.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
},
"content": {
"name": "content",
"kind": "module",
"path": "omniread.core.content",
"signature": null,
"docstring": "Canonical content models for OmniRead.\n\nThis module defines the **format-agnostic content representation** used across\nall parsers and scrapers in OmniRead.\n\nThe models defined here represent *what* was extracted, not *how* it was\nretrieved or parsed. Format-specific behavior and metadata must not alter\nthe semantic meaning of these models.",
"members": {
"Enum": {
"name": "Enum",
"kind": "alias",
"path": "omniread.core.content.Enum",
"signature": "<bound method Alias.signature of Alias('Enum', 'enum.Enum')>",
"docstring": null
},
"dataclass": {
"name": "dataclass",
"kind": "alias",
"path": "omniread.core.content.dataclass",
"signature": "<bound method Alias.signature of Alias('dataclass', 'dataclasses.dataclass')>",
"docstring": null
},
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.core.content.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.core.content.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.core.content.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.core.content.ContentType",
"signature": "<bound method Class.signature of Class('ContentType', 17, 36)>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.core.content.ContentType.HTML",
"signature": null,
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.core.content.ContentType.PDF",
"signature": null,
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.core.content.ContentType.JSON",
"signature": null,
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.core.content.ContentType.XML",
"signature": null,
"docstring": "XML document content."
}
}
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.content.Content",
"signature": "<bound method Class.signature of Class('Content', 39, 63)>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.content.Content.raw",
"signature": null,
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.content.Content.source",
"signature": null,
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.content.Content.content_type",
"signature": null,
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.content.Content.metadata",
"signature": null,
"docstring": null
}
}
}
}
},
"parser": {
"name": "parser",
"kind": "module",
"path": "omniread.core.parser",
"signature": null,
"docstring": "Abstract parsing contracts for OmniRead.\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources",
"members": {
"ABC": {
"name": "ABC",
"kind": "alias",
"path": "omniread.core.parser.ABC",
"signature": "<bound method Alias.signature of Alias('ABC', 'abc.ABC')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.core.parser.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"Generic": {
"name": "Generic",
"kind": "alias",
"path": "omniread.core.parser.Generic",
"signature": "<bound method Alias.signature of Alias('Generic', 'typing.Generic')>",
"docstring": null
},
"TypeVar": {
"name": "TypeVar",
"kind": "alias",
"path": "omniread.core.parser.TypeVar",
"signature": "<bound method Alias.signature of Alias('TypeVar', 'typing.TypeVar')>",
"docstring": null
},
"Set": {
"name": "Set",
"kind": "alias",
"path": "omniread.core.parser.Set",
"signature": "<bound method Alias.signature of Alias('Set', 'typing.Set')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.parser.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.parser.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.parser.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.parser.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.core.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"T": {
"name": "T",
"kind": "attribute",
"path": "omniread.core.parser.T",
"signature": null,
"docstring": null
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.core.parser.BaseParser",
"signature": "<bound method Class.signature of Class('BaseParser', 26, 98)>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.core.parser.BaseParser.supported_types",
"signature": null,
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.core.parser.BaseParser.content",
"signature": null,
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.core.parser.BaseParser.parse",
"signature": "<bound method Function.signature of Function('parse', 68, 82)>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.core.parser.BaseParser.supports",
"signature": "<bound method Function.signature of Function('supports', 84, 98)>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
}
}
},
"scraper": {
"name": "scraper",
"kind": "module",
"path": "omniread.core.scraper",
"signature": null,
"docstring": "Abstract scraping contracts for OmniRead.\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.",
"members": {
"ABC": {
"name": "ABC",
"kind": "alias",
"path": "omniread.core.scraper.ABC",
"signature": "<bound method Alias.signature of Alias('ABC', 'abc.ABC')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.core.scraper.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.core.scraper.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.core.scraper.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.core.scraper.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.scraper.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.scraper.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.scraper.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.scraper.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.core.scraper.BaseScraper",
"signature": "<bound method Class.signature of Class('BaseScraper', 26, 75)>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.core.scraper.BaseScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 49, 75)>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
}
}
}
}
}
}

View File

@@ -0,0 +1,162 @@
{
"module": "omniread.core.parser",
"content": {
"path": "omniread.core.parser",
"docstring": "Abstract parsing contracts for OmniRead.\n\nThis module defines the **format-agnostic parser interface** used to transform\nraw content into structured, typed representations.\n\nParsers are responsible for:\n- Interpreting a single `Content` instance\n- Validating compatibility with the content type\n- Producing a structured output suitable for downstream consumers\n\nParsers are not responsible for:\n- Fetching or acquiring content\n- Performing retries or error recovery\n- Managing multiple content sources",
"objects": {
"ABC": {
"name": "ABC",
"kind": "alias",
"path": "omniread.core.parser.ABC",
"signature": "<bound method Alias.signature of Alias('ABC', 'abc.ABC')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.core.parser.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"Generic": {
"name": "Generic",
"kind": "alias",
"path": "omniread.core.parser.Generic",
"signature": "<bound method Alias.signature of Alias('Generic', 'typing.Generic')>",
"docstring": null
},
"TypeVar": {
"name": "TypeVar",
"kind": "alias",
"path": "omniread.core.parser.TypeVar",
"signature": "<bound method Alias.signature of Alias('TypeVar', 'typing.TypeVar')>",
"docstring": null
},
"Set": {
"name": "Set",
"kind": "alias",
"path": "omniread.core.parser.Set",
"signature": "<bound method Alias.signature of Alias('Set', 'typing.Set')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.parser.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.parser.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.parser.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.parser.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.core.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.core.parser.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"T": {
"name": "T",
"kind": "attribute",
"path": "omniread.core.parser.T",
"signature": null,
"docstring": null
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.core.parser.BaseParser",
"signature": "<bound method Class.signature of Class('BaseParser', 26, 98)>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.core.parser.BaseParser.supported_types",
"signature": null,
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.core.parser.BaseParser.content",
"signature": null,
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.core.parser.BaseParser.parse",
"signature": "<bound method Function.signature of Function('parse', 68, 82)>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.core.parser.BaseParser.supports",
"signature": "<bound method Function.signature of Function('supports', 84, 98)>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
}
}
}
}

View File

@@ -0,0 +1,97 @@
{
"module": "omniread.core.scraper",
"content": {
"path": "omniread.core.scraper",
"docstring": "Abstract scraping contracts for OmniRead.\n\nThis module defines the **format-agnostic scraper interface** responsible for\nacquiring raw content from external sources.\n\nScrapers are responsible for:\n- Locating and retrieving raw content bytes\n- Attaching minimal contextual metadata\n- Returning normalized `Content` objects\n\nScrapers are explicitly NOT responsible for:\n- Parsing or interpreting content\n- Inferring structure or semantics\n- Performing content-type specific processing\n\nAll interpretation must be delegated to parsers.",
"objects": {
"ABC": {
"name": "ABC",
"kind": "alias",
"path": "omniread.core.scraper.ABC",
"signature": "<bound method Alias.signature of Alias('ABC', 'abc.ABC')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.core.scraper.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.core.scraper.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.core.scraper.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.core.scraper.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.core.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.core.scraper.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.core.scraper.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.core.scraper.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.core.scraper.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.core.scraper.BaseScraper",
"signature": "<bound method Class.signature of Class('BaseScraper', 26, 75)>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.core.scraper.BaseScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 49, 75)>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
}
}
}
}

View File

@@ -0,0 +1,488 @@
{
"module": "omniread.html",
"content": {
"path": "omniread.html",
"docstring": "HTML format implementation for OmniRead.\n\nThis package provides **HTML-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nIt includes:\n- HTML parsers that interpret HTML content\n- HTML scrapers that retrieve HTML documents\n\nThis package:\n- Implements, but does not redefine, core contracts\n- May contain HTML-specific behavior and edge-case handling\n- Produces canonical content models defined in `omniread.core.content`\n\nConsumers should depend on `omniread.core` interfaces wherever possible and\nuse this package only when HTML-specific behavior is required.",
"objects": {
"HTMLScraper": {
"name": "HTMLScraper",
"kind": "class",
"path": "omniread.html.HTMLScraper",
"signature": "<bound method Alias.signature of Alias('HTMLScraper', 'omniread.html.scraper.HTMLScraper')>",
"docstring": "Base HTML scraper using httpx.\n\nThis scraper retrieves HTML documents over HTTP(S) and returns them\nas raw content wrapped in a `Content` object.\n\nFetches raw bytes and metadata only.\nThe scraper:\n- Uses `httpx.Client` for HTTP requests\n- Enforces an HTML content type\n- Preserves HTTP response metadata\n\nThe scraper does not:\n- Parse HTML\n- Perform retries or backoff\n- Handle non-HTML responses",
"members": {
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.HTMLScraper.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.html.scraper.HTMLScraper.content_type')>",
"docstring": null
},
"validate_content_type": {
"name": "validate_content_type",
"kind": "function",
"path": "omniread.html.HTMLScraper.validate_content_type",
"signature": "<bound method Alias.signature of Alias('validate_content_type', 'omniread.html.scraper.HTMLScraper.validate_content_type')>",
"docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response: HTTP response returned by `httpx`.\n\nRaises:\n ValueError: If the `Content-Type` header is missing or does not\n indicate HTML content."
},
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.html.HTMLScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.html.scraper.HTMLScraper.fetch')>",
"docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source: URL of the HTML document.\n metadata: Optional metadata to be merged into the returned content.\n\nReturns:\n A `Content` instance containing:\n - Raw HTML bytes\n - Source URL\n - HTML content type\n - HTTP response metadata\n\nRaises:\n httpx.HTTPError: If the HTTP request fails.\n ValueError: If the response is not valid HTML."
}
}
},
"HTMLParser": {
"name": "HTMLParser",
"kind": "class",
"path": "omniread.html.HTMLParser",
"signature": "<bound method Alias.signature of Alias('HTMLParser', 'omniread.html.parser.HTMLParser')>",
"docstring": "Base HTML parser.\n\nThis class extends the core `BaseParser` with HTML-specific behavior,\nincluding DOM parsing via BeautifulSoup and reusable extraction helpers.\n\nProvides reusable helpers for HTML extraction.\nConcrete parsers must explicitly define the return type.\n\nCharacteristics:\n- Accepts only HTML content\n- Owns a parsed BeautifulSoup DOM tree\n- Provides pure helper utilities for common HTML structures\n\nConcrete subclasses must:\n- Define the output type `T`\n- Implement the `parse()` method",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.html.HTMLParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.html.parser.HTMLParser.supported_types')>",
"docstring": "Set of content types supported by this parser (HTML only)."
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.html.HTMLParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.html.parser.HTMLParser.parse')>",
"docstring": "Fully parse the HTML content into structured output.\n\nImplementations must fully interpret the HTML DOM and return\na deterministic, structured output.\n\nReturns:\n Parsed representation of type `T`."
},
"parse_div": {
"name": "parse_div",
"kind": "function",
"path": "omniread.html.HTMLParser.parse_div",
"signature": "<bound method Alias.signature of Alias('parse_div', 'omniread.html.parser.HTMLParser.parse_div')>",
"docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div: BeautifulSoup tag representing a `<div>`.\n separator: String used to separate text nodes.\n\nReturns:\n Flattened, whitespace-normalized text content."
},
"parse_link": {
"name": "parse_link",
"kind": "function",
"path": "omniread.html.HTMLParser.parse_link",
"signature": "<bound method Alias.signature of Alias('parse_link', 'omniread.html.parser.HTMLParser.parse_link')>",
"docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a: BeautifulSoup tag representing an anchor.\n\nReturns:\n The value of the `href` attribute, or None if absent."
},
"parse_table": {
"name": "parse_table",
"kind": "function",
"path": "omniread.html.HTMLParser.parse_table",
"signature": "<bound method Alias.signature of Alias('parse_table', 'omniread.html.parser.HTMLParser.parse_table')>",
"docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table: BeautifulSoup tag representing a `<table>`.\n\nReturns:\n A list of rows, where each row is a list of cell text values."
},
"parse_meta": {
"name": "parse_meta",
"kind": "function",
"path": "omniread.html.HTMLParser.parse_meta",
"signature": "<bound method Alias.signature of Alias('parse_meta', 'omniread.html.parser.HTMLParser.parse_meta')>",
"docstring": "Extract high-level metadata from the HTML document.\n\nThis includes:\n- Document title\n- `<meta>` tag name/property → content mappings\n\nReturns:\n Dictionary containing extracted metadata."
}
}
},
"parser": {
"name": "parser",
"kind": "module",
"path": "omniread.html.parser",
"signature": null,
"docstring": "HTML parser base implementations for OmniRead.\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.",
"members": {
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.html.parser.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Generic": {
"name": "Generic",
"kind": "alias",
"path": "omniread.html.parser.Generic",
"signature": "<bound method Alias.signature of Alias('Generic', 'typing.Generic')>",
"docstring": null
},
"TypeVar": {
"name": "TypeVar",
"kind": "alias",
"path": "omniread.html.parser.TypeVar",
"signature": "<bound method Alias.signature of Alias('TypeVar', 'typing.TypeVar')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.html.parser.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.html.parser.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"BeautifulSoup": {
"name": "BeautifulSoup",
"kind": "alias",
"path": "omniread.html.parser.BeautifulSoup",
"signature": "<bound method Alias.signature of Alias('BeautifulSoup', 'bs4.BeautifulSoup')>",
"docstring": null
},
"Tag": {
"name": "Tag",
"kind": "alias",
"path": "omniread.html.parser.Tag",
"signature": "<bound method Alias.signature of Alias('Tag', 'bs4.Tag')>",
"docstring": null
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.html.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.html.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.html.parser.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.html.parser.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.parser.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.html.parser.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.html.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.html.parser.BaseParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.core.parser.BaseParser.supported_types')>",
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.html.parser.BaseParser.content",
"signature": "<bound method Alias.signature of Alias('content', 'omniread.core.parser.BaseParser.content')>",
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.html.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.html.parser.BaseParser.supports",
"signature": "<bound method Alias.signature of Alias('supports', 'omniread.core.parser.BaseParser.supports')>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
},
"T": {
"name": "T",
"kind": "attribute",
"path": "omniread.html.parser.T",
"signature": null,
"docstring": null
},
"HTMLParser": {
"name": "HTMLParser",
"kind": "class",
"path": "omniread.html.parser.HTMLParser",
"signature": "<bound method Class.signature of Class('HTMLParser', 27, 177)>",
"docstring": "Base HTML parser.\n\nThis class extends the core `BaseParser` with HTML-specific behavior,\nincluding DOM parsing via BeautifulSoup and reusable extraction helpers.\n\nProvides reusable helpers for HTML extraction.\nConcrete parsers must explicitly define the return type.\n\nCharacteristics:\n- Accepts only HTML content\n- Owns a parsed BeautifulSoup DOM tree\n- Provides pure helper utilities for common HTML structures\n\nConcrete subclasses must:\n- Define the output type `T`\n- Implement the `parse()` method",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.html.parser.HTMLParser.supported_types",
"signature": null,
"docstring": "Set of content types supported by this parser (HTML only)."
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse",
"signature": "<bound method Function.signature of Function('parse', 70, 81)>",
"docstring": "Fully parse the HTML content into structured output.\n\nImplementations must fully interpret the HTML DOM and return\na deterministic, structured output.\n\nReturns:\n Parsed representation of type `T`."
},
"parse_div": {
"name": "parse_div",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_div",
"signature": "<bound method Function.signature of Function('parse_div', 87, 99)>",
"docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div: BeautifulSoup tag representing a `<div>`.\n separator: String used to separate text nodes.\n\nReturns:\n Flattened, whitespace-normalized text content."
},
"parse_link": {
"name": "parse_link",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_link",
"signature": "<bound method Function.signature of Function('parse_link', 101, 112)>",
"docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a: BeautifulSoup tag representing an anchor.\n\nReturns:\n The value of the `href` attribute, or None if absent."
},
"parse_table": {
"name": "parse_table",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_table",
"signature": "<bound method Function.signature of Function('parse_table', 114, 133)>",
"docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table: BeautifulSoup tag representing a `<table>`.\n\nReturns:\n A list of rows, where each row is a list of cell text values."
},
"parse_meta": {
"name": "parse_meta",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_meta",
"signature": "<bound method Function.signature of Function('parse_meta', 153, 177)>",
"docstring": "Extract high-level metadata from the HTML document.\n\nThis includes:\n- Document title\n- `<meta>` tag name/property → content mappings\n\nReturns:\n Dictionary containing extracted metadata."
}
}
},
"list": {
"name": "list",
"kind": "alias",
"path": "omniread.html.parser.list",
"signature": "<bound method Alias.signature of Alias('list', 'typing.list')>",
"docstring": null
},
"dict": {
"name": "dict",
"kind": "alias",
"path": "omniread.html.parser.dict",
"signature": "<bound method Alias.signature of Alias('dict', 'typing.dict')>",
"docstring": null
}
}
},
"scraper": {
"name": "scraper",
"kind": "module",
"path": "omniread.html.scraper",
"signature": null,
"docstring": "HTML scraping implementation for OmniRead.\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting",
"members": {
"httpx": {
"name": "httpx",
"kind": "alias",
"path": "omniread.html.scraper.httpx",
"signature": "<bound method Alias.signature of Alias('httpx', 'httpx')>",
"docstring": null
},
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.html.scraper.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.html.scraper.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.html.scraper.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.html.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.html.scraper.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.html.scraper.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.scraper.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.html.scraper.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.html.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.html.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.html.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
},
"HTMLScraper": {
"name": "HTMLScraper",
"kind": "class",
"path": "omniread.html.scraper.HTMLScraper",
"signature": "<bound method Class.signature of Class('HTMLScraper', 26, 134)>",
"docstring": "Base HTML scraper using httpx.\n\nThis scraper retrieves HTML documents over HTTP(S) and returns them\nas raw content wrapped in a `Content` object.\n\nFetches raw bytes and metadata only.\nThe scraper:\n- Uses `httpx.Client` for HTTP requests\n- Enforces an HTML content type\n- Preserves HTTP response metadata\n\nThe scraper does not:\n- Parse HTML\n- Perform retries or backoff\n- Handle non-HTML responses",
"members": {
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.scraper.HTMLScraper.content_type",
"signature": null,
"docstring": null
},
"validate_content_type": {
"name": "validate_content_type",
"kind": "function",
"path": "omniread.html.scraper.HTMLScraper.validate_content_type",
"signature": "<bound method Function.signature of Function('validate_content_type', 71, 94)>",
"docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response: HTTP response returned by `httpx`.\n\nRaises:\n ValueError: If the `Content-Type` header is missing or does not\n indicate HTML content."
},
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.html.scraper.HTMLScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 96, 134)>",
"docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source: URL of the HTML document.\n metadata: Optional metadata to be merged into the returned content.\n\nReturns:\n A `Content` instance containing:\n - Raw HTML bytes\n - Source URL\n - HTML content type\n - HTTP response metadata\n\nRaises:\n httpx.HTTPError: If the HTTP request fails.\n ValueError: If the response is not valid HTML."
}
}
}
}
}
}
}
}

View File

@@ -0,0 +1,241 @@
{
"module": "omniread.html.parser",
"content": {
"path": "omniread.html.parser",
"docstring": "HTML parser base implementations for OmniRead.\n\nThis module provides reusable HTML parsing utilities built on top of\nthe abstract parser contracts defined in `omniread.core.parser`.\n\nIt supplies:\n- Content-type enforcement for HTML inputs\n- BeautifulSoup initialization and lifecycle management\n- Common helper methods for extracting structured data from HTML elements\n\nConcrete parsers must subclass `HTMLParser` and implement the `parse()` method\nto return a structured representation appropriate for their use case.",
"objects": {
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.html.parser.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Generic": {
"name": "Generic",
"kind": "alias",
"path": "omniread.html.parser.Generic",
"signature": "<bound method Alias.signature of Alias('Generic', 'typing.Generic')>",
"docstring": null
},
"TypeVar": {
"name": "TypeVar",
"kind": "alias",
"path": "omniread.html.parser.TypeVar",
"signature": "<bound method Alias.signature of Alias('TypeVar', 'typing.TypeVar')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.html.parser.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.html.parser.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"BeautifulSoup": {
"name": "BeautifulSoup",
"kind": "alias",
"path": "omniread.html.parser.BeautifulSoup",
"signature": "<bound method Alias.signature of Alias('BeautifulSoup', 'bs4.BeautifulSoup')>",
"docstring": null
},
"Tag": {
"name": "Tag",
"kind": "alias",
"path": "omniread.html.parser.Tag",
"signature": "<bound method Alias.signature of Alias('Tag', 'bs4.Tag')>",
"docstring": null
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.html.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.html.parser.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.html.parser.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.html.parser.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.html.parser.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.parser.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.html.parser.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.html.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.html.parser.BaseParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.core.parser.BaseParser.supported_types')>",
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.html.parser.BaseParser.content",
"signature": "<bound method Alias.signature of Alias('content', 'omniread.core.parser.BaseParser.content')>",
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.html.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.html.parser.BaseParser.supports",
"signature": "<bound method Alias.signature of Alias('supports', 'omniread.core.parser.BaseParser.supports')>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
},
"T": {
"name": "T",
"kind": "attribute",
"path": "omniread.html.parser.T",
"signature": null,
"docstring": null
},
"HTMLParser": {
"name": "HTMLParser",
"kind": "class",
"path": "omniread.html.parser.HTMLParser",
"signature": "<bound method Class.signature of Class('HTMLParser', 27, 177)>",
"docstring": "Base HTML parser.\n\nThis class extends the core `BaseParser` with HTML-specific behavior,\nincluding DOM parsing via BeautifulSoup and reusable extraction helpers.\n\nProvides reusable helpers for HTML extraction.\nConcrete parsers must explicitly define the return type.\n\nCharacteristics:\n- Accepts only HTML content\n- Owns a parsed BeautifulSoup DOM tree\n- Provides pure helper utilities for common HTML structures\n\nConcrete subclasses must:\n- Define the output type `T`\n- Implement the `parse()` method",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.html.parser.HTMLParser.supported_types",
"signature": null,
"docstring": "Set of content types supported by this parser (HTML only)."
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse",
"signature": "<bound method Function.signature of Function('parse', 70, 81)>",
"docstring": "Fully parse the HTML content into structured output.\n\nImplementations must fully interpret the HTML DOM and return\na deterministic, structured output.\n\nReturns:\n Parsed representation of type `T`."
},
"parse_div": {
"name": "parse_div",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_div",
"signature": "<bound method Function.signature of Function('parse_div', 87, 99)>",
"docstring": "Extract normalized text from a `<div>` element.\n\nArgs:\n div: BeautifulSoup tag representing a `<div>`.\n separator: String used to separate text nodes.\n\nReturns:\n Flattened, whitespace-normalized text content."
},
"parse_link": {
"name": "parse_link",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_link",
"signature": "<bound method Function.signature of Function('parse_link', 101, 112)>",
"docstring": "Extract the hyperlink reference from an `<a>` element.\n\nArgs:\n a: BeautifulSoup tag representing an anchor.\n\nReturns:\n The value of the `href` attribute, or None if absent."
},
"parse_table": {
"name": "parse_table",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_table",
"signature": "<bound method Function.signature of Function('parse_table', 114, 133)>",
"docstring": "Parse an HTML table into a 2D list of strings.\n\nArgs:\n table: BeautifulSoup tag representing a `<table>`.\n\nReturns:\n A list of rows, where each row is a list of cell text values."
},
"parse_meta": {
"name": "parse_meta",
"kind": "function",
"path": "omniread.html.parser.HTMLParser.parse_meta",
"signature": "<bound method Function.signature of Function('parse_meta', 153, 177)>",
"docstring": "Extract high-level metadata from the HTML document.\n\nThis includes:\n- Document title\n- `<meta>` tag name/property → content mappings\n\nReturns:\n Dictionary containing extracted metadata."
}
}
},
"list": {
"name": "list",
"kind": "alias",
"path": "omniread.html.parser.list",
"signature": "<bound method Alias.signature of Alias('list', 'typing.list')>",
"docstring": null
},
"dict": {
"name": "dict",
"kind": "alias",
"path": "omniread.html.parser.dict",
"signature": "<bound method Alias.signature of Alias('dict', 'typing.dict')>",
"docstring": null
}
}
}
}

View File

@@ -0,0 +1,157 @@
{
"module": "omniread.html.scraper",
"content": {
"path": "omniread.html.scraper",
"docstring": "HTML scraping implementation for OmniRead.\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting",
"objects": {
"httpx": {
"name": "httpx",
"kind": "alias",
"path": "omniread.html.scraper.httpx",
"signature": "<bound method Alias.signature of Alias('httpx', 'httpx')>",
"docstring": null
},
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.html.scraper.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.html.scraper.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.html.scraper.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.html.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.html.scraper.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.html.scraper.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.scraper.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.html.scraper.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.html.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.html.scraper.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.html.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.html.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
},
"HTMLScraper": {
"name": "HTMLScraper",
"kind": "class",
"path": "omniread.html.scraper.HTMLScraper",
"signature": "<bound method Class.signature of Class('HTMLScraper', 26, 134)>",
"docstring": "Base HTML scraper using httpx.\n\nThis scraper retrieves HTML documents over HTTP(S) and returns them\nas raw content wrapped in a `Content` object.\n\nFetches raw bytes and metadata only.\nThe scraper:\n- Uses `httpx.Client` for HTTP requests\n- Enforces an HTML content type\n- Preserves HTTP response metadata\n\nThe scraper does not:\n- Parse HTML\n- Perform retries or backoff\n- Handle non-HTML responses",
"members": {
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.html.scraper.HTMLScraper.content_type",
"signature": null,
"docstring": null
},
"validate_content_type": {
"name": "validate_content_type",
"kind": "function",
"path": "omniread.html.scraper.HTMLScraper.validate_content_type",
"signature": "<bound method Function.signature of Function('validate_content_type', 71, 94)>",
"docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response: HTTP response returned by `httpx`.\n\nRaises:\n ValueError: If the `Content-Type` header is missing or does not\n indicate HTML content."
},
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.html.scraper.HTMLScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 96, 134)>",
"docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source: URL of the HTML document.\n metadata: Optional metadata to be merged into the returned content.\n\nReturns:\n A `Content` instance containing:\n - Raw HTML bytes\n - Source URL\n - HTML content type\n - HTTP response metadata\n\nRaises:\n httpx.HTTPError: If the HTTP request fails.\n ValueError: If the response is not valid HTML."
}
}
}
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,69 @@
{
"module": "omniread.pdf.client",
"content": {
"path": "omniread.pdf.client",
"docstring": "PDF client abstractions for OmniRead.\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems",
"objects": {
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.pdf.client.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"ABC": {
"name": "ABC",
"kind": "alias",
"path": "omniread.pdf.client.ABC",
"signature": "<bound method Alias.signature of Alias('ABC', 'abc.ABC')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.pdf.client.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"Path": {
"name": "Path",
"kind": "alias",
"path": "omniread.pdf.client.Path",
"signature": "<bound method Alias.signature of Alias('Path', 'pathlib.Path')>",
"docstring": null
},
"BasePDFClient": {
"name": "BasePDFClient",
"kind": "class",
"path": "omniread.pdf.client.BasePDFClient",
"signature": "<bound method Class.signature of Class('BasePDFClient', 22, 48)>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nImplementations must:\n- Accept a source identifier appropriate to the backing store\n- Return the full PDF binary payload\n- Raise retrieval-specific errors on failure",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.client.BasePDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 33, 48)>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source: Identifier of the PDF location, such as a file path,\n object storage key, or remote reference.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n Exception: Retrieval-specific errors defined by the implementation."
}
}
},
"FileSystemPDFClient": {
"name": "FileSystemPDFClient",
"kind": "class",
"path": "omniread.pdf.client.FileSystemPDFClient",
"signature": "<bound method Class.signature of Class('FileSystemPDFClient', 51, 80)>",
"docstring": "PDF client that reads from the local filesystem.\n\nThis client reads PDF files directly from the disk and returns their raw\nbinary contents.",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.client.FileSystemPDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 59, 80)>",
"docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path: Filesystem path to the PDF file.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError: If the path does not exist.\n ValueError: If the path exists but is not a file."
}
}
}
}
}
}

View File

@@ -0,0 +1,419 @@
{
"module": "omniread.pdf",
"content": {
"path": "omniread.pdf",
"docstring": "PDF format implementation for OmniRead.\n\nThis package provides **PDF-specific implementations** of the core OmniRead\ncontracts defined in `omniread.core`.\n\nUnlike HTML, PDF handling requires an explicit client layer for document\naccess. This package therefore includes:\n- PDF clients for acquiring raw PDF data\n- PDF scrapers that coordinate client access\n- PDF parsers that extract structured content from PDF binaries\n\nPublic exports from this package represent the supported PDF pipeline\nand are safe for consumers to import directly when working with PDFs.",
"objects": {
"FileSystemPDFClient": {
"name": "FileSystemPDFClient",
"kind": "class",
"path": "omniread.pdf.FileSystemPDFClient",
"signature": "<bound method Alias.signature of Alias('FileSystemPDFClient', 'omniread.pdf.client.FileSystemPDFClient')>",
"docstring": "PDF client that reads from the local filesystem.\n\nThis client reads PDF files directly from the disk and returns their raw\nbinary contents.",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.FileSystemPDFClient.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.pdf.client.FileSystemPDFClient.fetch')>",
"docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path: Filesystem path to the PDF file.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError: If the path does not exist.\n ValueError: If the path exists but is not a file."
}
}
},
"PDFScraper": {
"name": "PDFScraper",
"kind": "class",
"path": "omniread.pdf.PDFScraper",
"signature": "<bound method Alias.signature of Alias('PDFScraper', 'omniread.pdf.scraper.PDFScraper')>",
"docstring": "Scraper for PDF sources.\n\nDelegates byte retrieval to a PDF client and normalizes\noutput into Content.\n\nThe scraper:\n- Does not perform parsing or interpretation\n- Does not assume a specific storage backend\n- Preserves caller-provided metadata",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.PDFScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.pdf.scraper.PDFScraper.fetch')>",
"docstring": "Fetch a PDF document from the given source.\n\nArgs:\n source: Identifier of the PDF source as understood by the\n configured PDF client.\n metadata: Optional metadata to attach to the returned content.\n\nReturns:\n A `Content` instance containing:\n - Raw PDF bytes\n - Source identifier\n - PDF content type\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors raised by the PDF client."
}
}
},
"PDFParser": {
"name": "PDFParser",
"kind": "class",
"path": "omniread.pdf.PDFParser",
"signature": "<bound method Alias.signature of Alias('PDFParser', 'omniread.pdf.parser.PDFParser')>",
"docstring": "Base PDF parser.\n\nThis class enforces PDF content-type compatibility and provides the\nextension point for implementing concrete PDF parsing strategies.\n\nConcrete implementations must define:\n- Define the output type `T`\n- Implement the `parse()` method",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.pdf.PDFParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.pdf.parser.PDFParser.supported_types')>",
"docstring": "Set of content types supported by this parser (PDF only)."
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.pdf.PDFParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.pdf.parser.PDFParser.parse')>",
"docstring": "Parse PDF content into a structured output.\n\nImplementations must fully interpret the PDF binary payload and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed representation of type `T`.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
}
}
},
"client": {
"name": "client",
"kind": "module",
"path": "omniread.pdf.client",
"signature": null,
"docstring": "PDF client abstractions for OmniRead.\n\nThis module defines the **client layer** responsible for retrieving raw PDF\nbytes from a concrete backing store.\n\nClients provide low-level access to PDF binaries and are intentionally\ndecoupled from scraping and parsing logic. They do not perform validation,\ninterpretation, or content extraction.\n\nTypical backing stores include:\n- Local filesystems\n- Object storage (S3, GCS, etc.)\n- Network file systems",
"members": {
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.pdf.client.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"ABC": {
"name": "ABC",
"kind": "alias",
"path": "omniread.pdf.client.ABC",
"signature": "<bound method Alias.signature of Alias('ABC', 'abc.ABC')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.pdf.client.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"Path": {
"name": "Path",
"kind": "alias",
"path": "omniread.pdf.client.Path",
"signature": "<bound method Alias.signature of Alias('Path', 'pathlib.Path')>",
"docstring": null
},
"BasePDFClient": {
"name": "BasePDFClient",
"kind": "class",
"path": "omniread.pdf.client.BasePDFClient",
"signature": "<bound method Class.signature of Class('BasePDFClient', 22, 48)>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nImplementations must:\n- Accept a source identifier appropriate to the backing store\n- Return the full PDF binary payload\n- Raise retrieval-specific errors on failure",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.client.BasePDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 33, 48)>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source: Identifier of the PDF location, such as a file path,\n object storage key, or remote reference.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n Exception: Retrieval-specific errors defined by the implementation."
}
}
},
"FileSystemPDFClient": {
"name": "FileSystemPDFClient",
"kind": "class",
"path": "omniread.pdf.client.FileSystemPDFClient",
"signature": "<bound method Class.signature of Class('FileSystemPDFClient', 51, 80)>",
"docstring": "PDF client that reads from the local filesystem.\n\nThis client reads PDF files directly from the disk and returns their raw\nbinary contents.",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.client.FileSystemPDFClient.fetch",
"signature": "<bound method Function.signature of Function('fetch', 59, 80)>",
"docstring": "Read a PDF file from the local filesystem.\n\nArgs:\n path: Filesystem path to the PDF file.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n FileNotFoundError: If the path does not exist.\n ValueError: If the path exists but is not a file."
}
}
}
}
},
"parser": {
"name": "parser",
"kind": "module",
"path": "omniread.pdf.parser",
"signature": null,
"docstring": "PDF parser base implementations for OmniRead.\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.",
"members": {
"Generic": {
"name": "Generic",
"kind": "alias",
"path": "omniread.pdf.parser.Generic",
"signature": "<bound method Alias.signature of Alias('Generic', 'typing.Generic')>",
"docstring": null
},
"TypeVar": {
"name": "TypeVar",
"kind": "alias",
"path": "omniread.pdf.parser.TypeVar",
"signature": "<bound method Alias.signature of Alias('TypeVar', 'typing.TypeVar')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.pdf.parser.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.pdf.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.pdf.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.pdf.parser.BaseParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.core.parser.BaseParser.supported_types')>",
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.pdf.parser.BaseParser.content",
"signature": "<bound method Alias.signature of Alias('content', 'omniread.core.parser.BaseParser.content')>",
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.pdf.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.pdf.parser.BaseParser.supports",
"signature": "<bound method Alias.signature of Alias('supports', 'omniread.core.parser.BaseParser.supports')>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
},
"T": {
"name": "T",
"kind": "attribute",
"path": "omniread.pdf.parser.T",
"signature": null,
"docstring": null
},
"PDFParser": {
"name": "PDFParser",
"kind": "class",
"path": "omniread.pdf.parser.PDFParser",
"signature": "<bound method Class.signature of Class('PDFParser', 20, 49)>",
"docstring": "Base PDF parser.\n\nThis class enforces PDF content-type compatibility and provides the\nextension point for implementing concrete PDF parsing strategies.\n\nConcrete implementations must define:\n- Define the output type `T`\n- Implement the `parse()` method",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.pdf.parser.PDFParser.supported_types",
"signature": null,
"docstring": "Set of content types supported by this parser (PDF only)."
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.pdf.parser.PDFParser.parse",
"signature": "<bound method Function.signature of Function('parse', 35, 49)>",
"docstring": "Parse PDF content into a structured output.\n\nImplementations must fully interpret the PDF binary payload and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed representation of type `T`.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
}
}
}
}
},
"scraper": {
"name": "scraper",
"kind": "module",
"path": "omniread.pdf.scraper",
"signature": null,
"docstring": "PDF scraping implementation for OmniRead.\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.",
"members": {
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.pdf.scraper.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.pdf.scraper.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.pdf.scraper.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.pdf.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.pdf.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.pdf.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
},
"BasePDFClient": {
"name": "BasePDFClient",
"kind": "class",
"path": "omniread.pdf.scraper.BasePDFClient",
"signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nImplementations must:\n- Accept a source identifier appropriate to the backing store\n- Return the full PDF binary payload\n- Raise retrieval-specific errors on failure",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.scraper.BasePDFClient.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.pdf.client.BasePDFClient.fetch')>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source: Identifier of the PDF location, such as a file path,\n object storage key, or remote reference.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n Exception: Retrieval-specific errors defined by the implementation."
}
}
},
"PDFScraper": {
"name": "PDFScraper",
"kind": "class",
"path": "omniread.pdf.scraper.PDFScraper",
"signature": "<bound method Class.signature of Class('PDFScraper', 18, 71)>",
"docstring": "Scraper for PDF sources.\n\nDelegates byte retrieval to a PDF client and normalizes\noutput into Content.\n\nThe scraper:\n- Does not perform parsing or interpretation\n- Does not assume a specific storage backend\n- Preserves caller-provided metadata",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.scraper.PDFScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 40, 71)>",
"docstring": "Fetch a PDF document from the given source.\n\nArgs:\n source: Identifier of the PDF source as understood by the\n configured PDF client.\n metadata: Optional metadata to attach to the returned content.\n\nReturns:\n A `Content` instance containing:\n - Raw PDF bytes\n - Source identifier\n - PDF content type\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors raised by the PDF client."
}
}
}
}
}
}
}
}

View File

@@ -0,0 +1,134 @@
{
"module": "omniread.pdf.parser",
"content": {
"path": "omniread.pdf.parser",
"docstring": "PDF parser base implementations for OmniRead.\n\nThis module defines the **PDF-specific parser contract**, extending the\nformat-agnostic `BaseParser` with constraints appropriate for PDF content.\n\nPDF parsers are responsible for interpreting binary PDF data and producing\nstructured representations suitable for downstream consumption.",
"objects": {
"Generic": {
"name": "Generic",
"kind": "alias",
"path": "omniread.pdf.parser.Generic",
"signature": "<bound method Alias.signature of Alias('Generic', 'typing.Generic')>",
"docstring": null
},
"TypeVar": {
"name": "TypeVar",
"kind": "alias",
"path": "omniread.pdf.parser.TypeVar",
"signature": "<bound method Alias.signature of Alias('TypeVar', 'typing.TypeVar')>",
"docstring": null
},
"abstractmethod": {
"name": "abstractmethod",
"kind": "alias",
"path": "omniread.pdf.parser.abstractmethod",
"signature": "<bound method Alias.signature of Alias('abstractmethod', 'abc.abstractmethod')>",
"docstring": null
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.pdf.parser.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.pdf.parser.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseParser": {
"name": "BaseParser",
"kind": "class",
"path": "omniread.pdf.parser.BaseParser",
"signature": "<bound method Alias.signature of Alias('BaseParser', 'omniread.core.parser.BaseParser')>",
"docstring": "Base interface for all parsers.\n\nA parser is a self-contained object that owns the Content\nit is responsible for interpreting.\n\nImplementations must:\n- Declare supported content types via `supported_types`\n- Raise parsing-specific exceptions from `parse()`\n- Remain deterministic for a given input\n\nConsumers may rely on:\n- Early validation of content compatibility\n- Type-stable return values from `parse()`",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.pdf.parser.BaseParser.supported_types",
"signature": "<bound method Alias.signature of Alias('supported_types', 'omniread.core.parser.BaseParser.supported_types')>",
"docstring": "Set of content types supported by this parser.\n\nAn empty set indicates that the parser is content-type agnostic."
},
"content": {
"name": "content",
"kind": "attribute",
"path": "omniread.pdf.parser.BaseParser.content",
"signature": "<bound method Alias.signature of Alias('content', 'omniread.core.parser.BaseParser.content')>",
"docstring": null
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.pdf.parser.BaseParser.parse",
"signature": "<bound method Alias.signature of Alias('parse', 'omniread.core.parser.BaseParser.parse')>",
"docstring": "Parse the owned content into structured output.\n\nImplementations must fully consume the provided content and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed, structured representation.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
},
"supports": {
"name": "supports",
"kind": "function",
"path": "omniread.pdf.parser.BaseParser.supports",
"signature": "<bound method Alias.signature of Alias('supports', 'omniread.core.parser.BaseParser.supports')>",
"docstring": "Check whether this parser supports the content's type.\n\nReturns:\n True if the content type is supported; False otherwise."
}
}
},
"T": {
"name": "T",
"kind": "attribute",
"path": "omniread.pdf.parser.T",
"signature": null,
"docstring": null
},
"PDFParser": {
"name": "PDFParser",
"kind": "class",
"path": "omniread.pdf.parser.PDFParser",
"signature": "<bound method Class.signature of Class('PDFParser', 20, 49)>",
"docstring": "Base PDF parser.\n\nThis class enforces PDF content-type compatibility and provides the\nextension point for implementing concrete PDF parsing strategies.\n\nConcrete implementations must define:\n- Define the output type `T`\n- Implement the `parse()` method",
"members": {
"supported_types": {
"name": "supported_types",
"kind": "attribute",
"path": "omniread.pdf.parser.PDFParser.supported_types",
"signature": null,
"docstring": "Set of content types supported by this parser (PDF only)."
},
"parse": {
"name": "parse",
"kind": "function",
"path": "omniread.pdf.parser.PDFParser.parse",
"signature": "<bound method Function.signature of Function('parse', 35, 49)>",
"docstring": "Parse PDF content into a structured output.\n\nImplementations must fully interpret the PDF binary payload and\nreturn a deterministic, structured output.\n\nReturns:\n Parsed representation of type `T`.\n\nRaises:\n Exception: Parsing-specific errors as defined by the implementation."
}
}
}
}
}
}

View File

@@ -0,0 +1,152 @@
{
"module": "omniread.pdf.scraper",
"content": {
"path": "omniread.pdf.scraper",
"docstring": "PDF scraping implementation for OmniRead.\n\nThis module provides a PDF-specific scraper that coordinates PDF byte\nretrieval via a client and normalizes the result into a `Content` object.\n\nThe scraper implements the core `BaseScraper` contract while delegating\nall storage and access concerns to a `BasePDFClient` implementation.",
"objects": {
"Any": {
"name": "Any",
"kind": "alias",
"path": "omniread.pdf.scraper.Any",
"signature": "<bound method Alias.signature of Alias('Any', 'typing.Any')>",
"docstring": null
},
"Mapping": {
"name": "Mapping",
"kind": "alias",
"path": "omniread.pdf.scraper.Mapping",
"signature": "<bound method Alias.signature of Alias('Mapping', 'typing.Mapping')>",
"docstring": null
},
"Optional": {
"name": "Optional",
"kind": "alias",
"path": "omniread.pdf.scraper.Optional",
"signature": "<bound method Alias.signature of Alias('Optional', 'typing.Optional')>",
"docstring": null
},
"Content": {
"name": "Content",
"kind": "class",
"path": "omniread.pdf.scraper.Content",
"signature": "<bound method Alias.signature of Alias('Content', 'omniread.core.content.Content')>",
"docstring": "Normalized representation of extracted content.\n\nA `Content` instance represents a raw content payload along with minimal\ncontextual metadata describing its origin and type.\n\nThis class is the **primary exchange format** between:\n- Scrapers\n- Parsers\n- Downstream consumers\n\nAttributes:\n raw: Raw content bytes as retrieved from the source.\n source: Identifier of the content origin (URL, file path, or logical name).\n content_type: Optional MIME type of the content, if known.\n metadata: Optional, implementation-defined metadata associated with\n the content (e.g., headers, encoding hints, extraction notes).",
"members": {
"raw": {
"name": "raw",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.raw",
"signature": "<bound method Alias.signature of Alias('raw', 'omniread.core.content.Content.raw')>",
"docstring": null
},
"source": {
"name": "source",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.source",
"signature": "<bound method Alias.signature of Alias('source', 'omniread.core.content.Content.source')>",
"docstring": null
},
"content_type": {
"name": "content_type",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.content_type",
"signature": "<bound method Alias.signature of Alias('content_type', 'omniread.core.content.Content.content_type')>",
"docstring": null
},
"metadata": {
"name": "metadata",
"kind": "attribute",
"path": "omniread.pdf.scraper.Content.metadata",
"signature": "<bound method Alias.signature of Alias('metadata', 'omniread.core.content.Content.metadata')>",
"docstring": null
}
}
},
"ContentType": {
"name": "ContentType",
"kind": "class",
"path": "omniread.pdf.scraper.ContentType",
"signature": "<bound method Alias.signature of Alias('ContentType', 'omniread.core.content.ContentType')>",
"docstring": "Supported MIME types for extracted content.\n\nThis enum represents the declared or inferred media type of the content\nsource. It is primarily used for routing content to the appropriate\nparser or downstream consumer.",
"members": {
"HTML": {
"name": "HTML",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.HTML",
"signature": "<bound method Alias.signature of Alias('HTML', 'omniread.core.content.ContentType.HTML')>",
"docstring": "HTML document content."
},
"PDF": {
"name": "PDF",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.PDF",
"signature": "<bound method Alias.signature of Alias('PDF', 'omniread.core.content.ContentType.PDF')>",
"docstring": "PDF document content."
},
"JSON": {
"name": "JSON",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.JSON",
"signature": "<bound method Alias.signature of Alias('JSON', 'omniread.core.content.ContentType.JSON')>",
"docstring": "JSON document content."
},
"XML": {
"name": "XML",
"kind": "attribute",
"path": "omniread.pdf.scraper.ContentType.XML",
"signature": "<bound method Alias.signature of Alias('XML', 'omniread.core.content.ContentType.XML')>",
"docstring": "XML document content."
}
}
},
"BaseScraper": {
"name": "BaseScraper",
"kind": "class",
"path": "omniread.pdf.scraper.BaseScraper",
"signature": "<bound method Alias.signature of Alias('BaseScraper', 'omniread.core.scraper.BaseScraper')>",
"docstring": "Base interface for all scrapers.\n\nA scraper is responsible ONLY for fetching raw content\n(bytes) from a source. It must not interpret or parse it.\n\nA scraper is a **stateless acquisition component** that retrieves raw\ncontent from a source and returns it as a `Content` object.\n\nScrapers define *how content is obtained*, not *what the content means*.\n\nImplementations may vary in:\n- Transport mechanism (HTTP, filesystem, cloud storage)\n- Authentication strategy\n- Retry and backoff behavior\n\nImplementations must not:\n- Parse content\n- Modify content semantics\n- Couple scraping logic to a specific parser",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.scraper.BaseScraper.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.core.scraper.BaseScraper.fetch')>",
"docstring": "Fetch raw content from the given source.\n\nImplementations must retrieve the content referenced by `source`\nand return it as raw bytes wrapped in a `Content` object.\n\nArgs:\n source: Location identifier (URL, file path, S3 URI, etc.)\n metadata: Optional hints for the scraper (headers, auth, etc.)\n\nReturns:\n Content object containing raw bytes and metadata.\n - Raw content bytes\n - Source identifier\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors as defined by the implementation."
}
}
},
"BasePDFClient": {
"name": "BasePDFClient",
"kind": "class",
"path": "omniread.pdf.scraper.BasePDFClient",
"signature": "<bound method Alias.signature of Alias('BasePDFClient', 'omniread.pdf.client.BasePDFClient')>",
"docstring": "Abstract client responsible for retrieving PDF bytes\nfrom a specific backing store (filesystem, S3, FTP, etc.).\n\nImplementations must:\n- Accept a source identifier appropriate to the backing store\n- Return the full PDF binary payload\n- Raise retrieval-specific errors on failure",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.scraper.BasePDFClient.fetch",
"signature": "<bound method Alias.signature of Alias('fetch', 'omniread.pdf.client.BasePDFClient.fetch')>",
"docstring": "Fetch raw PDF bytes from the given source.\n\nArgs:\n source: Identifier of the PDF location, such as a file path,\n object storage key, or remote reference.\n\nReturns:\n Raw PDF bytes.\n\nRaises:\n Exception: Retrieval-specific errors defined by the implementation."
}
}
},
"PDFScraper": {
"name": "PDFScraper",
"kind": "class",
"path": "omniread.pdf.scraper.PDFScraper",
"signature": "<bound method Class.signature of Class('PDFScraper', 18, 71)>",
"docstring": "Scraper for PDF sources.\n\nDelegates byte retrieval to a PDF client and normalizes\noutput into Content.\n\nThe scraper:\n- Does not perform parsing or interpretation\n- Does not assume a specific storage backend\n- Preserves caller-provided metadata",
"members": {
"fetch": {
"name": "fetch",
"kind": "function",
"path": "omniread.pdf.scraper.PDFScraper.fetch",
"signature": "<bound method Function.signature of Function('fetch', 40, 71)>",
"docstring": "Fetch a PDF document from the given source.\n\nArgs:\n source: Identifier of the PDF source as understood by the\n configured PDF client.\n metadata: Optional metadata to attach to the returned content.\n\nReturns:\n A `Content` instance containing:\n - Raw PDF bytes\n - Source identifier\n - PDF content type\n - Optional metadata\n\nRaises:\n Exception: Retrieval-specific errors raised by the PDF client."
}
}
}
}
}
}

50
mcp_docs/nav.json Normal file
View File

@@ -0,0 +1,50 @@
[
{
"module": "omniread",
"resource": "doc://modules/omniread"
},
{
"module": "omniread.core",
"resource": "doc://modules/omniread.core"
},
{
"module": "omniread.core.content",
"resource": "doc://modules/omniread.core.content"
},
{
"module": "omniread.core.parser",
"resource": "doc://modules/omniread.core.parser"
},
{
"module": "omniread.core.scraper",
"resource": "doc://modules/omniread.core.scraper"
},
{
"module": "omniread.html",
"resource": "doc://modules/omniread.html"
},
{
"module": "omniread.html.parser",
"resource": "doc://modules/omniread.html.parser"
},
{
"module": "omniread.html.scraper",
"resource": "doc://modules/omniread.html.scraper"
},
{
"module": "omniread.pdf",
"resource": "doc://modules/omniread.pdf"
},
{
"module": "omniread.pdf.client",
"resource": "doc://modules/omniread.pdf.client"
},
{
"module": "omniread.pdf.parser",
"resource": "doc://modules/omniread.pdf.parser"
},
{
"module": "omniread.pdf.scraper",
"resource": "doc://modules/omniread.pdf.scraper"
}
]

54
mkdocs.yml Normal file
View File

@@ -0,0 +1,54 @@
site_name: Aetoskia OmniRead
site_description: Format-agnostic document reading, parsing, and scraping framework
theme:
name: material
palette:
- scheme: slate
primary: deep purple
accent: cyan
font:
text: Inter
code: JetBrains Mono
features:
- navigation.tabs
- navigation.expand
- navigation.top
- navigation.instant
- content.code.copy
- content.code.annotate
plugins:
- search
- mkdocstrings:
handlers:
python:
paths:
- .
options:
docstring_style: google
show_source: false
show_signature_annotations: true
separate_signature: true
merge_init_into_class: true
inherited_members: true
annotations_path: brief
show_root_heading: true
group_by_category: true
nav:
- Home: omniread/index.md
- Core API:
- omniread/core/index.md
- omniread/core/content.md
- omniread/core/parser.md
- omniread/core/scraper.md
- HTML Handling:
- omniread/html/index.md
- omniread/html/parser.md
- omniread/html/scraper.md
- PDF Handling:
- omniread/pdf/index.md
- omniread/pdf/client.md
- omniread/pdf/parser.md
- omniread/pdf/scraper.md

View File

@@ -0,0 +1,137 @@
"""
OmniRead — format-agnostic content acquisition and parsing framework.
OmniRead provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.
The library is structured around three core concepts:
1. **Content**
A canonical, format-agnostic container representing raw content bytes
and minimal contextual metadata.
2. **Scrapers**
Components responsible for *acquiring* raw content from a source
(HTTP, filesystem, object storage, etc.). Scrapers never interpret
content.
3. **Parsers**
Components responsible for *interpreting* acquired content and
converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation
- Replaceable implementations per format
- Predictable, testable behavior
----------------------------------------------------------------------
Installation
----------------------------------------------------------------------
Install OmniRead using pip:
pip install omniread
Or with Poetry:
poetry add omniread
----------------------------------------------------------------------
Basic Usage
----------------------------------------------------------------------
HTML example:
from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch("https://example.com")
class TitleParser(HTMLParser[str]):
def parse(self) -> str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
PDF example:
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))
class TextPDFParser(PDFParser[str]):
def parse(self) -> str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
----------------------------------------------------------------------
Public API Surface
----------------------------------------------------------------------
This module re-exports the **recommended public entry points** of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.
Core:
- Content
- ContentType
HTML:
- HTMLScraper
- HTMLParser
PDF:
- FileSystemPDFClient
- PDFScraper
- PDFParser
## Core Philosophy
`OmniRead` is designed as a **decoupled content engine**:
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON.
## Documentation Design
For those extending `OmniRead`, follow these "AI-Native" docstring principles:
### For Humans
- **Clear Contracts**: Explicitly state what a component is and is NOT responsible for.
- **Runnable Examples**: Include small, logical snippets in the package `__init__.py`.
### For LLMs
- **Structured Models**: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
- **Type Safety**: All public APIs must be fully typed and have corresponding `.pyi` stubs.
- **Detailed Raises**: Include `: description` pairs in the `Raises` section to help agents handle errors gracefully.
"""
from .core import Content, ContentType
from .html import HTMLScraper, HTMLParser
from .pdf import FileSystemPDFClient, PDFScraper, PDFParser
__all__ = [
# core
"Content",
"ContentType",
# html
"HTMLScraper",
"HTMLParser",
# pdf
"FileSystemPDFClient",
"PDFScraper",
"PDFParser",
]

13
omniread/__init__.pyi Normal file
View File

@@ -0,0 +1,13 @@
from .core import Content, ContentType
from .html import HTMLScraper, HTMLParser
from .pdf import FileSystemPDFClient, PDFScraper, PDFParser
__all__ = [
"Content",
"ContentType",
"HTMLScraper",
"HTMLParser",
"FileSystemPDFClient",
"PDFScraper",
"PDFParser",
]

View File

@@ -0,0 +1,24 @@
"""
Core domain contracts for OmniRead.
This package defines the **format-agnostic domain layer** of OmniRead.
It exposes canonical content models and abstract interfaces that are
implemented by format-specific modules (HTML, PDF, etc.).
Public exports from this package are considered **stable contracts** and
are safe for downstream consumers to depend on.
Submodules:
- content: Canonical content models and enums
- parser: Abstract parsing contracts
- scraper: Abstract scraping contracts
Format-specific behavior must not be introduced at this layer.
"""
from .content import Content, ContentType
__all__ = [
"Content",
"ContentType",
]

View File

@@ -0,0 +1,10 @@
from .content import Content, ContentType
from .parser import BaseParser
from .scraper import BaseScraper
__all__ = [
"Content",
"ContentType",
"BaseParser",
"BaseScraper",
]

View File

@@ -1,17 +1,62 @@
"""
Canonical content models for OmniRead.
This module defines the **format-agnostic content representation** used across
all parsers and scrapers in OmniRead.
The models defined here represent *what* was extracted, not *how* it was
retrieved or parsed. Format-specific behavior and metadata must not alter
the semantic meaning of these models.
"""
from enum import Enum
from dataclasses import dataclass
from typing import Any, Mapping, Optional
class ContentType(str, Enum):
"""
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content
source. It is primarily used for routing content to the appropriate
parser or downstream consumer.
"""
HTML = "text/html"
"""HTML document content."""
PDF = "application/pdf"
"""PDF document content."""
JSON = "application/json"
"""JSON document content."""
XML = "application/xml"
"""XML document content."""
@dataclass(slots=True)
class Content:
"""
Normalized representation of extracted content.
A `Content` instance represents a raw content payload along with minimal
contextual metadata describing its origin and type.
This class is the **primary exchange format** between:
- Scrapers
- Parsers
- Downstream consumers
Attributes:
raw: Raw content bytes as retrieved from the source.
source: Identifier of the content origin (URL, file path, or logical name).
content_type: Optional MIME type of the content, if known.
metadata: Optional, implementation-defined metadata associated with
the content (e.g., headers, encoding hints, extraction notes).
"""
raw: bytes
source: str
content_type: Optional[ContentType] = None

15
omniread/core/content.pyi Normal file
View File

@@ -0,0 +1,15 @@
from enum import Enum
from typing import Any, Mapping, Optional
class ContentType(str, Enum):
HTML = "text/html"
PDF = "application/pdf"
JSON = "application/json"
XML = "application/xml"
class Content:
raw: bytes
source: str
content_type: Optional[ContentType]
metadata: Optional[Mapping[str, Any]]
def __init__(self, raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...) -> None: ...

View File

@@ -1,3 +1,20 @@
"""
Abstract parsing contracts for OmniRead.
This module defines the **format-agnostic parser interface** used to transform
raw content into structured, typed representations.
Parsers are responsible for:
- Interpreting a single `Content` instance
- Validating compatibility with the content type
- Producing a structured output suitable for downstream consumers
Parsers are not responsible for:
- Fetching or acquiring content
- Performing retries or error recovery
- Managing multiple content sources
"""
from abc import ABC, abstractmethod
from typing import Generic, TypeVar, Set
@@ -12,11 +29,34 @@ class BaseParser(ABC, Generic[T]):
A parser is a self-contained object that owns the Content
it is responsible for interpreting.
Implementations must:
- Declare supported content types via `supported_types`
- Raise parsing-specific exceptions from `parse()`
- Remain deterministic for a given input
Consumers may rely on:
- Early validation of content compatibility
- Type-stable return values from `parse()`
"""
supported_types: Set[ContentType] = set()
"""Set of content types supported by this parser.
An empty set indicates that the parser is content-type agnostic.
"""
def __init__(self, content: Content):
"""
Initialize the parser with content to be parsed.
Args:
content: Content instance to be parsed.
Raises:
ValueError: If the content type is not supported by this parser.
"""
self.content = content
if not self.supports():
@@ -30,15 +70,25 @@ class BaseParser(ABC, Generic[T]):
"""
Parse the owned content into structured output.
Implementations must fully consume the provided content and
return a deterministic, structured output.
Returns:
Parsed, structured representation.
Raises:
Exception: Parsing-specific errors as defined by the implementation.
"""
raise NotImplementedError
def supports(self) -> bool:
"""
Check whether this parser supports the content's type.
Returns:
True if the content type is supported; False otherwise.
"""
if not self.supported_types:
return True

13
omniread/core/parser.pyi Normal file
View File

@@ -0,0 +1,13 @@
from abc import ABC, abstractmethod
from typing import Generic, TypeVar, Set
from .content import Content, ContentType
T = TypeVar("T")
class BaseParser(ABC, Generic[T]):
supported_types: Set[ContentType]
content: Content
def __init__(self, content: Content) -> None: ...
@abstractmethod
def parse(self) -> T: ...
def supports(self) -> bool: ...

View File

@@ -1,3 +1,22 @@
"""
Abstract scraping contracts for OmniRead.
This module defines the **format-agnostic scraper interface** responsible for
acquiring raw content from external sources.
Scrapers are responsible for:
- Locating and retrieving raw content bytes
- Attaching minimal contextual metadata
- Returning normalized `Content` objects
Scrapers are explicitly NOT responsible for:
- Parsing or interpreting content
- Inferring structure or semantics
- Performing content-type specific processing
All interpretation must be delegated to parsers.
"""
from abc import ABC, abstractmethod
from typing import Any, Mapping, Optional
@@ -10,6 +29,21 @@ class BaseScraper(ABC):
A scraper is responsible ONLY for fetching raw content
(bytes) from a source. It must not interpret or parse it.
A scraper is a **stateless acquisition component** that retrieves raw
content from a source and returns it as a `Content` object.
Scrapers define *how content is obtained*, not *what the content means*.
Implementations may vary in:
- Transport mechanism (HTTP, filesystem, cloud storage)
- Authentication strategy
- Retry and backoff behavior
Implementations must not:
- Parse content
- Modify content semantics
- Couple scraping logic to a specific parser
"""
@abstractmethod
@@ -22,11 +56,20 @@ class BaseScraper(ABC):
"""
Fetch raw content from the given source.
Implementations must retrieve the content referenced by `source`
and return it as raw bytes wrapped in a `Content` object.
Args:
source: Location identifier (URL, file path, S3 URI, etc.)
metadata: Optional hints for the scraper (headers, auth, etc.)
Returns:
Content object containing raw bytes and metadata.
- Raw content bytes
- Source identifier
- Optional metadata
Raises:
Exception: Retrieval-specific errors as defined by the implementation.
"""
raise NotImplementedError

View File

@@ -0,0 +1,7 @@
from abc import ABC, abstractmethod
from typing import Any, Mapping, Optional
from .content import Content
class BaseScraper(ABC):
@abstractmethod
def fetch(self, source: str, *, metadata: Optional[Mapping[str, Any]] = ...) -> Content: ...

View File

@@ -0,0 +1,27 @@
"""
HTML format implementation for OmniRead.
This package provides **HTML-specific implementations** of the core OmniRead
contracts defined in `omniread.core`.
It includes:
- HTML parsers that interpret HTML content
- HTML scrapers that retrieve HTML documents
This package:
- Implements, but does not redefine, core contracts
- May contain HTML-specific behavior and edge-case handling
- Produces canonical content models defined in `omniread.core.content`
Consumers should depend on `omniread.core` interfaces wherever possible and
use this package only when HTML-specific behavior is required.
"""
from .scraper import HTMLScraper
from .parser import HTMLParser
__all__ = [
"HTMLScraper",
"HTMLParser",
]

View File

@@ -0,0 +1,4 @@
from .scraper import HTMLScraper
from .parser import HTMLParser
__all__ = ["HTMLScraper", "HTMLParser"]

View File

@@ -1,6 +1,21 @@
from typing import Any, Generic, TypeVar, Optional
"""
HTML parser base implementations for OmniRead.
This module provides reusable HTML parsing utilities built on top of
the abstract parser contracts defined in `omniread.core.parser`.
It supplies:
- Content-type enforcement for HTML inputs
- BeautifulSoup initialization and lifecycle management
- Common helper methods for extracting structured data from HTML elements
Concrete parsers must subclass `HTMLParser` and implement the `parse()` method
to return a structured representation appropriate for their use case.
"""
from typing import Any, Generic, TypeVar, Optional
from abc import abstractmethod
from bs4 import BeautifulSoup, Tag
from omniread.core.content import ContentType, Content
@@ -13,13 +28,37 @@ class HTMLParser(BaseParser[T], Generic[T]):
"""
Base HTML parser.
This class extends the core `BaseParser` with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction.
Concrete parsers must explicitly define the return type.
Characteristics:
- Accepts only HTML content
- Owns a parsed BeautifulSoup DOM tree
- Provides pure helper utilities for common HTML structures
Concrete subclasses must:
- Define the output type `T`
- Implement the `parse()` method
"""
supported_types = {ContentType.HTML}
"""Set of content types supported by this parser (HTML only)."""
def __init__(self, content: Content, features: str = "html.parser"):
"""
Initialize the HTML parser.
Args:
content: HTML content to be parsed.
features: BeautifulSoup parser backend to use
(e.g., 'html.parser', 'lxml').
Raises:
ValueError: If the content is empty or not valid HTML.
"""
super().__init__(content)
self._features = features
self._soup = self._get_soup()
@@ -32,6 +71,12 @@ class HTMLParser(BaseParser[T], Generic[T]):
def parse(self) -> T:
"""
Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return
a deterministic, structured output.
Returns:
Parsed representation of type `T`.
"""
raise NotImplementedError
@@ -41,14 +86,42 @@ class HTMLParser(BaseParser[T], Generic[T]):
@staticmethod
def parse_div(div: Tag, *, separator: str = " ") -> str:
"""
Extract normalized text from a `<div>` element.
Args:
div: BeautifulSoup tag representing a `<div>`.
separator: String used to separate text nodes.
Returns:
Flattened, whitespace-normalized text content.
"""
return div.get_text(separator=separator, strip=True)
@staticmethod
def parse_link(a: Tag) -> Optional[str]:
"""
Extract the hyperlink reference from an `<a>` element.
Args:
a: BeautifulSoup tag representing an anchor.
Returns:
The value of the `href` attribute, or None if absent.
"""
return a.get("href")
@staticmethod
def parse_table(table: Tag) -> list[list[str]]:
"""
Parse an HTML table into a 2D list of strings.
Args:
table: BeautifulSoup tag representing a `<table>`.
Returns:
A list of rows, where each row is a list of cell text values.
"""
rows: list[list[str]] = []
for tr in table.find_all("tr"):
cells = [
@@ -64,11 +137,30 @@ class HTMLParser(BaseParser[T], Generic[T]):
# ----------------------------
def _get_soup(self) -> BeautifulSoup:
"""
Build a BeautifulSoup DOM tree from raw HTML content.
Returns:
Parsed BeautifulSoup document tree.
Raises:
ValueError: If the content payload is empty.
"""
if not self.content.raw:
raise ValueError("Empty HTML content")
return BeautifulSoup(self.content.raw, features=self._features)
def parse_meta(self) -> dict[str, Any]:
"""
Extract high-level metadata from the HTML document.
This includes:
- Document title
- `<meta>` tag name/property → content mappings
Returns:
Dictionary containing extracted metadata.
"""
soup = self._soup
title = soup.title.string.strip() if soup.title and soup.title.string else None

18
omniread/html/parser.pyi Normal file
View File

@@ -0,0 +1,18 @@
from typing import Any, Generic, TypeVar, Optional, list, dict
from bs4 import BeautifulSoup, Tag
from omniread.core.content import ContentType, Content
from omniread.core.parser import BaseParser
T = TypeVar("T")
class HTMLParser(BaseParser[T], Generic[T]):
supported_types: set[ContentType]
def __init__(self, content: Content, features: str = ...) -> None: ...
def parse(self) -> T: ...
@staticmethod
def parse_div(div: Tag, *, separator: str = ...) -> str: ...
@staticmethod
def parse_link(a: Tag) -> Optional[str]: ...
@staticmethod
def parse_table(table: Tag) -> list[list[str]]: ...
def parse_meta(self) -> dict[str, Any]: ...

View File

@@ -1,27 +1,97 @@
"""
HTML scraping implementation for OmniRead.
This module provides an HTTP-based scraper for retrieving HTML documents.
It implements the core `BaseScraper` contract using `httpx` as the transport
layer.
This scraper is responsible for:
- Fetching raw HTML bytes over HTTP(S)
- Validating response content type
- Attaching HTTP metadata to the returned content
This scraper is not responsible for:
- Parsing or interpreting HTML
- Retrying failed requests
- Managing crawl policies or rate limiting
"""
import httpx
from typing import Any, Mapping, Optional
from omniread.core.content import Content
from omniread.core.content import Content, ContentType
from omniread.core.scraper import BaseScraper
class HTMLScraper(BaseScraper):
"""
Base HTTP scraper using httpx.
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them
as raw content wrapped in a `Content` object.
Fetches raw bytes and metadata only.
The scraper:
- Uses `httpx.Client` for HTTP requests
- Enforces an HTML content type
- Preserves HTTP response metadata
The scraper does not:
- Parse HTML
- Perform retries or backoff
- Handle non-HTML responses
"""
def __init__(
self,
*,
client: httpx.Client | None = None,
timeout: float = 15.0,
headers: Optional[Mapping[str, str]] = None,
follow_redirects: bool = True,
):
self.timeout = timeout
self.headers = dict(headers) if headers else {}
self.follow_redirects = follow_redirects
"""
Initialize the HTML scraper.
Args:
client: Optional pre-configured `httpx.Client`. If omitted,
a client is created internally.
timeout: Request timeout in seconds.
headers: Optional default HTTP headers.
follow_redirects: Whether to follow HTTP redirects.
"""
self._client = client or httpx.Client(
timeout=timeout,
headers=headers,
follow_redirects=follow_redirects,
)
self.content_type = ContentType.HTML
def validate_content_type(
self,
response: httpx.Response,
):
"""
Validate that the HTTP response contains HTML content.
Args:
response: HTTP response returned by `httpx`.
Raises:
ValueError: If the `Content-Type` header is missing or does not
indicate HTML content.
"""
raw_ct = response.headers.get("Content-Type")
if not raw_ct:
raise ValueError("Missing Content-Type header")
base_ct = raw_ct.split(";", 1)[0].strip().lower()
if base_ct != self.content_type.value:
raise ValueError(
f"Expected HTML content, got '{raw_ct}'"
)
def fetch(
self,
@@ -29,20 +99,36 @@ class HTMLScraper(BaseScraper):
*,
metadata: Optional[Mapping[str, Any]] = None,
) -> Content:
with httpx.Client(
timeout=self.timeout,
headers=self.headers,
follow_redirects=self.follow_redirects,
) as client:
response = client.get(source)
"""
Fetch an HTML document from the given source.
Args:
source: URL of the HTML document.
metadata: Optional metadata to be merged into the returned content.
Returns:
A `Content` instance containing:
- Raw HTML bytes
- Source URL
- HTML content type
- HTTP response metadata
Raises:
httpx.HTTPError: If the HTTP request fails.
ValueError: If the response is not valid HTML.
"""
response = self._client.get(source)
response.raise_for_status()
self.validate_content_type(response)
return Content(
raw=response.content,
source=source,
content_type=response.headers.get("Content-Type"),
content_type=self.content_type,
metadata={
"status_code": response.status_code,
"headers": dict(response.headers),
**(metadata or {}),
},
)

10
omniread/html/scraper.pyi Normal file
View File

@@ -0,0 +1,10 @@
import httpx
from typing import Any, Mapping, Optional
from omniread.core.content import Content, ContentType
from omniread.core.scraper import BaseScraper
class HTMLScraper(BaseScraper):
content_type: ContentType
def __init__(self, *, client: Optional[httpx.Client] = ..., timeout: float = ..., headers: Optional[Mapping[str, str]] = ..., follow_redirects: bool = ...) -> None: ...
def validate_content_type(self, response: httpx.Response) -> None: ...
def fetch(self, source: str, *, metadata: Optional[Mapping[str, Any]] = ...) -> Content: ...

25
omniread/pdf/__init__.py Normal file
View File

@@ -0,0 +1,25 @@
"""
PDF format implementation for OmniRead.
This package provides **PDF-specific implementations** of the core OmniRead
contracts defined in `omniread.core`.
Unlike HTML, PDF handling requires an explicit client layer for document
access. This package therefore includes:
- PDF clients for acquiring raw PDF data
- PDF scrapers that coordinate client access
- PDF parsers that extract structured content from PDF binaries
Public exports from this package represent the supported PDF pipeline
and are safe for consumers to import directly when working with PDFs.
"""
from .client import FileSystemPDFClient
from .scraper import PDFScraper
from .parser import PDFParser
__all__ = [
"FileSystemPDFClient",
"PDFScraper",
"PDFParser",
]

View File

@@ -0,0 +1,5 @@
from .client import FileSystemPDFClient
from .scraper import PDFScraper
from .parser import PDFParser
__all__ = ["FileSystemPDFClient", "PDFScraper", "PDFParser"]

80
omniread/pdf/client.py Normal file
View File

@@ -0,0 +1,80 @@
"""
PDF client abstractions for OmniRead.
This module defines the **client layer** responsible for retrieving raw PDF
bytes from a concrete backing store.
Clients provide low-level access to PDF binaries and are intentionally
decoupled from scraping and parsing logic. They do not perform validation,
interpretation, or content extraction.
Typical backing stores include:
- Local filesystems
- Object storage (S3, GCS, etc.)
- Network file systems
"""
from typing import Any
from abc import ABC, abstractmethod
from pathlib import Path
class BasePDFClient(ABC):
"""
Abstract client responsible for retrieving PDF bytes
from a specific backing store (filesystem, S3, FTP, etc.).
Implementations must:
- Accept a source identifier appropriate to the backing store
- Return the full PDF binary payload
- Raise retrieval-specific errors on failure
"""
@abstractmethod
def fetch(self, source: Any) -> bytes:
"""
Fetch raw PDF bytes from the given source.
Args:
source: Identifier of the PDF location, such as a file path,
object storage key, or remote reference.
Returns:
Raw PDF bytes.
Raises:
Exception: Retrieval-specific errors defined by the implementation.
"""
raise NotImplementedError
class FileSystemPDFClient(BasePDFClient):
"""
PDF client that reads from the local filesystem.
This client reads PDF files directly from the disk and returns their raw
binary contents.
"""
def fetch(self, path: Path) -> bytes:
"""
Read a PDF file from the local filesystem.
Args:
path: Filesystem path to the PDF file.
Returns:
Raw PDF bytes.
Raises:
FileNotFoundError: If the path does not exist.
ValueError: If the path exists but is not a file.
"""
if not path.exists():
raise FileNotFoundError(f"PDF not found: {path}")
if not path.is_file():
raise ValueError(f"Path is not a file: {path}")
return path.read_bytes()

10
omniread/pdf/client.pyi Normal file
View File

@@ -0,0 +1,10 @@
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Any
class BasePDFClient(ABC):
@abstractmethod
def fetch(self, source: Any) -> bytes: ...
class FileSystemPDFClient(BasePDFClient):
def fetch(self, source: Path | str) -> bytes: ...

49
omniread/pdf/parser.py Normal file
View File

@@ -0,0 +1,49 @@
"""
PDF parser base implementations for OmniRead.
This module defines the **PDF-specific parser contract**, extending the
format-agnostic `BaseParser` with constraints appropriate for PDF content.
PDF parsers are responsible for interpreting binary PDF data and producing
structured representations suitable for downstream consumption.
"""
from typing import Generic, TypeVar
from abc import abstractmethod
from omniread.core.content import ContentType
from omniread.core.parser import BaseParser
T = TypeVar("T")
class PDFParser(BaseParser[T], Generic[T]):
"""
Base PDF parser.
This class enforces PDF content-type compatibility and provides the
extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define:
- Define the output type `T`
- Implement the `parse()` method
"""
supported_types = {ContentType.PDF}
"""Set of content types supported by this parser (PDF only)."""
@abstractmethod
def parse(self) -> T:
"""
Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and
return a deterministic, structured output.
Returns:
Parsed representation of type `T`.
Raises:
Exception: Parsing-specific errors as defined by the implementation.
"""
raise NotImplementedError

11
omniread/pdf/parser.pyi Normal file
View File

@@ -0,0 +1,11 @@
from abc import abstractmethod
from typing import Generic, TypeVar
from omniread.core.content import ContentType
from omniread.core.parser import BaseParser
T = TypeVar("T")
class PDFParser(BaseParser[T], Generic[T]):
supported_types: set[ContentType]
@abstractmethod
def parse(self) -> T: ...

71
omniread/pdf/scraper.py Normal file
View File

@@ -0,0 +1,71 @@
"""
PDF scraping implementation for OmniRead.
This module provides a PDF-specific scraper that coordinates PDF byte
retrieval via a client and normalizes the result into a `Content` object.
The scraper implements the core `BaseScraper` contract while delegating
all storage and access concerns to a `BasePDFClient` implementation.
"""
from typing import Any, Mapping, Optional
from omniread.core.content import Content, ContentType
from omniread.core.scraper import BaseScraper
from omniread.pdf.client import BasePDFClient
class PDFScraper(BaseScraper):
"""
Scraper for PDF sources.
Delegates byte retrieval to a PDF client and normalizes
output into Content.
The scraper:
- Does not perform parsing or interpretation
- Does not assume a specific storage backend
- Preserves caller-provided metadata
"""
def __init__(self, *, client: BasePDFClient):
"""
Initialize the PDF scraper.
Args:
client: PDF client responsible for retrieving raw PDF bytes.
"""
self._client = client
def fetch(
self,
source: Any,
*,
metadata: Optional[Mapping[str, Any]] = None,
) -> Content:
"""
Fetch a PDF document from the given source.
Args:
source: Identifier of the PDF source as understood by the
configured PDF client.
metadata: Optional metadata to attach to the returned content.
Returns:
A `Content` instance containing:
- Raw PDF bytes
- Source identifier
- PDF content type
- Optional metadata
Raises:
Exception: Retrieval-specific errors raised by the PDF client.
"""
raw = self._client.fetch(source)
return Content(
raw=raw,
source=source,
content_type=ContentType.PDF,
metadata=dict(metadata) if metadata else None,
)

8
omniread/pdf/scraper.pyi Normal file
View File

@@ -0,0 +1,8 @@
from typing import Any, Mapping, Optional
from omniread.core.content import Content, ContentType
from omniread.core.scraper import BaseScraper
from .client import BasePDFClient
class PDFScraper(BaseScraper):
def __init__(self, *, client: BasePDFClient) -> None: ...
def fetch(self, source: Any, *, metadata: Optional[Mapping[str, Any]] = ...) -> Content: ...

View File

@@ -1,7 +0,0 @@
httpx==0.27.0
beautifulsoup4==4.12.0
# lxml==5.2.0
pytest==7.4.0
pytest-asyncio==0.21.0
pytest-cov==4.1.0

0
tests/__init__.py Normal file
View File

90
tests/conftest.py Normal file
View File

@@ -0,0 +1,90 @@
import json
import pytest
import httpx
from pathlib import Path
from jinja2 import Environment, BaseLoader
from omniread import (
# core
ContentType,
# html
HTMLScraper,
# pdf
FileSystemPDFClient,
PDFScraper,
)
MOCK_HTML_DIR = Path(__file__).parent / "mocks" / "html"
MOCK_PDF_DIR = Path(__file__).parent / "mocks" / "pdf"
def render_html(template_path, data_path) -> bytes:
template_text = Path(template_path).read_text(encoding="utf-8")
data = json.loads(Path(data_path).read_text(encoding="utf-8"))
env = Environment(
loader=BaseLoader(),
autoescape=False,
)
template = env.from_string(template_text)
rendered = template.render(**data)
return rendered.encode("utf-8")
def mock_transport(request: httpx.Request) -> httpx.Response:
"""
httpx MockTransport handler.
"""
path = request.url.path
if path not in ['/simple', '/table']:
return httpx.Response(
status_code=404,
content=b"Not Found",
request=request,
)
endpoint = path.split("/")[-1]
content = render_html(
MOCK_HTML_DIR / f"{endpoint}.html.jinja",
MOCK_HTML_DIR / f"{endpoint}.json",
)
return httpx.Response(
status_code=200,
headers={"Content-Type": ContentType.HTML.value},
content=content,
request=request,
)
@pytest.fixture
def http_scraper() -> HTMLScraper:
transport = httpx.MockTransport(mock_transport)
client = httpx.Client(transport=transport)
return HTMLScraper(client=client)
class MockPDFClient(FileSystemPDFClient):
"""
Test-only PDF client that routes logical identifiers
to fixture files.
"""
def fetch(self, source: str) -> bytes:
if source in ["simple"]:
source = MOCK_PDF_DIR / f"{source}.pdf"
else:
raise FileNotFoundError(f"No mock PDF route for '{source}'")
return super().fetch(source)
@pytest.fixture
def pdf_scraper() -> PDFScraper:
client = MockPDFClient()
return PDFScraper(client=client)

View File

@@ -0,0 +1,11 @@
<!DOCTYPE html>
<html>
<head>
<title>{{ title }}</title>
<meta name="description" content="{{ description }}">
</head>
<body>
<div id="content">{{ content }}</div>
<a href="{{ link_url }}">{{ link_text }}</a>
</body>
</html>

View File

@@ -0,0 +1,7 @@
{
"title": "Test Page",
"description": "Simple test page",
"content": "Hello World",
"link_url": "https://example.com",
"link_text": "Link"
}

View File

@@ -0,0 +1,31 @@
<!DOCTYPE html>
<html>
<head>
<title>{{ title }}</title>
<meta name="description" content="{{ description }}">
</head>
<body>
<h1>{{ heading }}</h1>
<table id="{{ table_id }}">
<thead>
<tr>
{% for col in columns %}
<th>{{ col }}</th>
{% endfor %}
</tr>
</thead>
<tbody>
{% for row in rows %}
<tr>
{% for cell in row %}
<td>{{ cell }}</td>
{% endfor %}
</tr>
{% endfor %}
</tbody>
</table>
<a href="{{ link_url }}">{{ link_text }}</a>
</body>
</html>

View File

@@ -0,0 +1,14 @@
{
"title": "Table Test Page",
"description": "HTML page with a table for parsing tests",
"heading": "Sample Table",
"table_id": "data-table",
"columns": ["Name", "Age", "City"],
"rows": [
["Alice", "30", "London"],
["Bob", "25", "New York"],
["Charlie", "35", "Berlin"]
],
"link_url": "https://example.org/details",
"link_text": "Details"
}

View File

@@ -0,0 +1,32 @@
%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R >>
endobj
4 0 obj
<< /Length 44 >>
stream
BT
/F1 12 Tf
72 720 Td
(Simple PDF Test) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000061 00000 n
0000000116 00000 n
0000000203 00000 n
trailer
<< /Size 5 /Root 1 0 R >>
startxref
300
%%EOF

53
tests/test_html_simple.py Normal file
View File

@@ -0,0 +1,53 @@
from typing import Optional
from pydantic import BaseModel
from bs4 import Tag
from omniread import (
# core
Content,
# html
HTMLParser,
)
class ParsedSimpleHTML(BaseModel):
title: Optional[str]
description: Optional[str]
content: Optional[str]
link: Optional[str]
class SimpleHTMLParser(HTMLParser[ParsedSimpleHTML]):
"""
Parser focused on high-level page semantics.
"""
def parse(self) -> ParsedSimpleHTML:
soup = self._soup
meta = self.parse_meta()
content_div = soup.find("div", id="content")
link_tag: Tag | None = soup.find("a")
return ParsedSimpleHTML(
title=meta["title"],
description=meta["meta"].get("description"),
content=self.parse_div(content_div) if content_div else None,
link=self.parse_link(link_tag) if link_tag else None,
)
def test_end_to_end_html_simple(http_scraper):
content: Content = http_scraper.fetch("https://test.local/simple")
parser = SimpleHTMLParser(content)
result = parser.parse()
assert isinstance(result, ParsedSimpleHTML)
assert result.title == "Test Page"
assert result.description == "Simple test page"
assert result.content == "Hello World"
assert result.link == "https://example.com"

49
tests/test_html_table.py Normal file
View File

@@ -0,0 +1,49 @@
from typing import Optional
from pydantic import BaseModel
from omniread import (
# core
Content,
# html
HTMLParser,
)
class ParsedTableHTML(BaseModel):
title: Optional[str]
table: list[list[str]]
class TableHTMLParser(HTMLParser[ParsedTableHTML]):
"""
Parser focused on extracting tabular data.
"""
def parse(self) -> ParsedTableHTML:
soup = self._soup
table_tag = soup.find("table")
return ParsedTableHTML(
title=soup.title.string.strip() if soup.title else None,
table=self.parse_table(table_tag) if table_tag else [],
)
def test_end_to_end_html_table(http_scraper):
content: Content = http_scraper.fetch("https://test.local/table")
parser = TableHTMLParser(content)
result = parser.parse()
assert isinstance(result, ParsedTableHTML)
assert result.title == "Table Test Page"
assert result.table == [
["Name", "Age", "City"],
["Alice", "30", "London"],
["Bob", "25", "New York"],
["Charlie", "35", "Berlin"],
]

41
tests/test_pdf_simple.py Normal file
View File

@@ -0,0 +1,41 @@
from typing import Literal
from pydantic import BaseModel
from omniread import (
# core
Content,
# pdf
PDFParser,
)
class ParsedPDF(BaseModel):
size_bytes: int
magic: Literal[b"%PDF"]
class SimplePDFParser(PDFParser[ParsedPDF]):
def parse(self) -> ParsedPDF:
raw = self.content.raw
if not raw.startswith(b"%PDF"):
raise ValueError("Not a valid PDF")
return ParsedPDF(
size_bytes=len(raw),
magic=b"%PDF",
)
def test_end_to_end_pdf_simple(pdf_scraper):
# --- Scrape (identifier-based, routed in conftest)
content: Content = pdf_scraper.fetch("simple")
assert content.raw.startswith(b"%PDF")
# --- Parse
parser = SimplePDFParser(content)
result = parser.parse()
assert result.magic == b"%PDF"
assert result.size_bytes > 100