Files
omniread/omniread/__init__.py
Vishesh 'ironeagle' Bangotra 67a3074ab4 using doc-forge (#1)
Reviewed-on: #1
Co-authored-by: Vishesh 'ironeagle' Bangotra <aetoskia@gmail.com>
Co-committed-by: Vishesh 'ironeagle' Bangotra <aetoskia@gmail.com>
2026-01-22 11:27:56 +00:00

138 lines
3.9 KiB
Python

"""
OmniRead — format-agnostic content acquisition and parsing framework.
OmniRead provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.
The library is structured around three core concepts:
1. **Content**
A canonical, format-agnostic container representing raw content bytes
and minimal contextual metadata.
2. **Scrapers**
Components responsible for *acquiring* raw content from a source
(HTTP, filesystem, object storage, etc.). Scrapers never interpret
content.
3. **Parsers**
Components responsible for *interpreting* acquired content and
converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation
- Replaceable implementations per format
- Predictable, testable behavior
----------------------------------------------------------------------
Installation
----------------------------------------------------------------------
Install OmniRead using pip:
pip install omniread
Or with Poetry:
poetry add omniread
----------------------------------------------------------------------
Basic Usage
----------------------------------------------------------------------
HTML example:
from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch("https://example.com")
class TitleParser(HTMLParser[str]):
def parse(self) -> str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
PDF example:
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))
class TextPDFParser(PDFParser[str]):
def parse(self) -> str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
----------------------------------------------------------------------
Public API Surface
----------------------------------------------------------------------
This module re-exports the **recommended public entry points** of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.
Core:
- Content
- ContentType
HTML:
- HTMLScraper
- HTMLParser
PDF:
- FileSystemPDFClient
- PDFScraper
- PDFParser
## Core Philosophy
`OmniRead` is designed as a **decoupled content engine**:
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON.
## Documentation Design
For those extending `OmniRead`, follow these "AI-Native" docstring principles:
### For Humans
- **Clear Contracts**: Explicitly state what a component is and is NOT responsible for.
- **Runnable Examples**: Include small, logical snippets in the package `__init__.py`.
### For LLMs
- **Structured Models**: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
- **Type Safety**: All public APIs must be fully typed and have corresponding `.pyi` stubs.
- **Detailed Raises**: Include `: description` pairs in the `Raises` section to help agents handle errors gracefully.
"""
from .core import Content, ContentType
from .html import HTMLScraper, HTMLParser
from .pdf import FileSystemPDFClient, PDFScraper, PDFParser
__all__ = [
# core
"Content",
"ContentType",
# html
"HTMLScraper",
"HTMLParser",
# pdf
"FileSystemPDFClient",
"PDFScraper",
"PDFParser",
]