omniread

Summary

OmniRead — format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure:

Clear boundaries between IO and interpretation.
Replaceable implementations per format.
Predictable, testable behavior.

Installation

Install OmniRead using pip:

pip install omniread

Install OmniRead using Poetry:

poetry add omniread

Quick start

Example: HTML example: ```python from omniread import HTMLScraper, HTMLParser

    scraper = HTMLScraper()
    content = scraper.fetch("https://example.com")

    class TitleParser(HTMLParser[str]):
        def parse(self) -> str:
            return self._soup.title.string

    parser = TitleParser(content)
    title = parser.parse()
    ```

PDF example:
    ```python
    from omniread import FileSystemPDFClient, PDFScraper, PDFParser
    from pathlib import Path

    client = FileSystemPDFClient()
    scraper = PDFScraper(client=client)
    content = scraper.fetch(Path("document.pdf"))

    class TextPDFParser(PDFParser[str]):
        def parse(self) -> str:
            # implement PDF text extraction
            ...

    parser = TextPDFParser(content)
    result = parser.parse()
    ```

Public API

This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Content: Canonical content model.
ContentType: Supported media types.
HTMLScraper: HTTP-based HTML acquisition.
HTMLParser: Base parser for HTML DOM interpretation.
FileSystemPDFClient: Local filesystem PDF access.
PDFScraper: PDF-specific content acquisition.
PDFParser: Base parser for PDF binary interpretation.

Core Philosophy

OmniRead is designed as a decoupled content engine:

Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

3.0 KiB Raw Blame History