Reviewed-on: #1 Co-authored-by: Vishesh 'ironeagle' Bangotra <aetoskia@gmail.com> Co-committed-by: Vishesh 'ironeagle' Bangotra <aetoskia@gmail.com>
138 lines
3.9 KiB
Python
138 lines
3.9 KiB
Python
"""
|
|
OmniRead — format-agnostic content acquisition and parsing framework.
|
|
|
|
OmniRead provides a **cleanly layered architecture** for fetching, parsing,
|
|
and normalizing content from heterogeneous sources such as HTML documents
|
|
and PDF files.
|
|
|
|
The library is structured around three core concepts:
|
|
|
|
1. **Content**
|
|
A canonical, format-agnostic container representing raw content bytes
|
|
and minimal contextual metadata.
|
|
|
|
2. **Scrapers**
|
|
Components responsible for *acquiring* raw content from a source
|
|
(HTTP, filesystem, object storage, etc.). Scrapers never interpret
|
|
content.
|
|
|
|
3. **Parsers**
|
|
Components responsible for *interpreting* acquired content and
|
|
converting it into structured, typed representations.
|
|
|
|
OmniRead deliberately separates these responsibilities to ensure:
|
|
- Clear boundaries between IO and interpretation
|
|
- Replaceable implementations per format
|
|
- Predictable, testable behavior
|
|
|
|
----------------------------------------------------------------------
|
|
Installation
|
|
----------------------------------------------------------------------
|
|
|
|
Install OmniRead using pip:
|
|
|
|
pip install omniread
|
|
|
|
Or with Poetry:
|
|
|
|
poetry add omniread
|
|
|
|
----------------------------------------------------------------------
|
|
Basic Usage
|
|
----------------------------------------------------------------------
|
|
|
|
HTML example:
|
|
|
|
from omniread import HTMLScraper, HTMLParser
|
|
|
|
scraper = HTMLScraper()
|
|
content = scraper.fetch("https://example.com")
|
|
|
|
class TitleParser(HTMLParser[str]):
|
|
def parse(self) -> str:
|
|
return self._soup.title.string
|
|
|
|
parser = TitleParser(content)
|
|
title = parser.parse()
|
|
|
|
PDF example:
|
|
|
|
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
|
|
from pathlib import Path
|
|
|
|
client = FileSystemPDFClient()
|
|
scraper = PDFScraper(client=client)
|
|
content = scraper.fetch(Path("document.pdf"))
|
|
|
|
class TextPDFParser(PDFParser[str]):
|
|
def parse(self) -> str:
|
|
# implement PDF text extraction
|
|
...
|
|
|
|
parser = TextPDFParser(content)
|
|
result = parser.parse()
|
|
|
|
----------------------------------------------------------------------
|
|
Public API Surface
|
|
----------------------------------------------------------------------
|
|
|
|
This module re-exports the **recommended public entry points** of OmniRead.
|
|
|
|
Consumers are encouraged to import from this namespace rather than from
|
|
format-specific submodules directly, unless advanced customization is
|
|
required.
|
|
|
|
Core:
|
|
- Content
|
|
- ContentType
|
|
|
|
HTML:
|
|
- HTMLScraper
|
|
- HTMLParser
|
|
|
|
PDF:
|
|
- FileSystemPDFClient
|
|
- PDFScraper
|
|
- PDFParser
|
|
|
|
## Core Philosophy
|
|
|
|
`OmniRead` is designed as a **decoupled content engine**:
|
|
|
|
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other.
|
|
2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract.
|
|
3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON.
|
|
|
|
## Documentation Design
|
|
|
|
For those extending `OmniRead`, follow these "AI-Native" docstring principles:
|
|
|
|
### For Humans
|
|
- **Clear Contracts**: Explicitly state what a component is and is NOT responsible for.
|
|
- **Runnable Examples**: Include small, logical snippets in the package `__init__.py`.
|
|
|
|
### For LLMs
|
|
- **Structured Models**: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
|
|
- **Type Safety**: All public APIs must be fully typed and have corresponding `.pyi` stubs.
|
|
- **Detailed Raises**: Include `: description` pairs in the `Raises` section to help agents handle errors gracefully.
|
|
"""
|
|
|
|
from .core import Content, ContentType
|
|
from .html import HTMLScraper, HTMLParser
|
|
from .pdf import FileSystemPDFClient, PDFScraper, PDFParser
|
|
|
|
__all__ = [
|
|
# core
|
|
"Content",
|
|
"ContentType",
|
|
|
|
# html
|
|
"HTMLScraper",
|
|
"HTMLParser",
|
|
|
|
# pdf
|
|
"FileSystemPDFClient",
|
|
"PDFScraper",
|
|
"PDFParser",
|
|
]
|