omniread

Summary

OmniRead — format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

  1. Content: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
  2. Scrapers: Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
  3. Parsers: Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure:

  • Clear boundaries between IO and interpretation.
  • Replaceable implementations per format.
  • Predictable, testable behavior.

Installation

Install OmniRead using pip:

pip install omniread

Install OmniRead using Poetry:

poetry add omniread

Quick start

Example: HTML example: ```python from omniread import HTMLScraper, HTMLParser

    scraper = HTMLScraper()
    content = scraper.fetch("https://example.com")

    class TitleParser(HTMLParser[str]):
        def parse(self) -> str:
            return self._soup.title.string

    parser = TitleParser(content)
    title = parser.parse()
    ```

PDF example:
    ```python
    from omniread import FileSystemPDFClient, PDFScraper, PDFParser
    from pathlib import Path

    client = FileSystemPDFClient()
    scraper = PDFScraper(client=client)
    content = scraper.fetch(Path("document.pdf"))

    class TextPDFParser(PDFParser[str]):
        def parse(self) -> str:
            # implement PDF text extraction
            ...

    parser = TextPDFParser(content)
    result = parser.parse()
    ```

Public API

This module re-exports the recommended public entry points of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

  • Content: Canonical content model.
  • ContentType: Supported media types.
  • HTMLScraper: HTTP-based HTML acquisition.
  • HTMLParser: Base parser for HTML DOM interpretation.
  • FileSystemPDFClient: Local filesystem PDF access.
  • PDFScraper: PDF-specific content acquisition.
  • PDFParser: Base parser for PDF binary interpretation.

Core Philosophy

OmniRead is designed as a decoupled content engine:

  1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
  2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
  3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

Description
No description provided
Readme 255 KiB
Languages
Python 97.7%
Jinja 2.3%