3.0 KiB
omniread
Summary
OmniRead — format-agnostic content acquisition and parsing framework.
OmniRead provides a cleanly layered architecture for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.
The library is structured around three core concepts:
Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.).Scrapersnever interpret content.Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation.
- Replaceable implementations per format.
- Predictable, testable behavior.
Installation
Install OmniRead using pip:
pip install omniread
Install OmniRead using Poetry:
poetry add omniread
Quick start
Example: HTML example: ```python from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch("https://example.com")
class TitleParser(HTMLParser[str]):
def parse(self) -> str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
```
PDF example:
```python
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))
class TextPDFParser(PDFParser[str]):
def parse(self) -> str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
```
Public API
This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.
Content: Canonical content model.ContentType: Supported media types.HTMLScraper: HTTP-based HTML acquisition.HTMLParser: Base parser for HTML DOM interpretation.FileSystemPDFClient: Local filesystem PDF access.PDFScraper: PDF-specific content acquisition.PDFParser: Base parser for PDF binary interpretation.
Core Philosophy
OmniRead is designed as a decoupled content engine:
- Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
- Normalized Exchange: All components communicate via the
Contentmodel, ensuring a consistent contract. - Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.