omniread/README.md

# omniread

# Summary

`OmniRead` — format-agnostic content acquisition and parsing framework.

`OmniRead` provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.

The library is structured around three core concepts:

1.  **`Content`**: A canonical, format-agnostic container representing raw content
    bytes and minimal contextual metadata.
2.  **`Scrapers`**: Components responsible for *acquiring* raw content from a
    source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret
    content.
3.  **`Parsers`**: Components responsible for *interpreting* acquired content and
    converting it into structured, typed representations.

`OmniRead` deliberately separates these responsibilities to ensure:

-   Clear boundaries between IO and interpretation.
-   Replaceable implementations per format.
-   Predictable, testable behavior.

# Installation

Install `OmniRead` using pip:

```bash
pip install omniread
```

Install OmniRead using Poetry:
```bash
poetry add omniread
```

---

## Quick start

Example:
    HTML example:
        ```python
        from omniread import HTMLScraper, HTMLParser

        scraper = HTMLScraper()
        content = scraper.fetch("https://example.com")

        class TitleParser(HTMLParser[str]):
            def parse(self) -> str:
                return self._soup.title.string

        parser = TitleParser(content)
        title = parser.parse()
        ```

    PDF example:
        ```python
        from omniread import FileSystemPDFClient, PDFScraper, PDFParser
        from pathlib import Path

        client = FileSystemPDFClient()
        scraper = PDFScraper(client=client)
        content = scraper.fetch(Path("document.pdf"))

        class TextPDFParser(PDFParser[str]):
            def parse(self) -> str:
                # implement PDF text extraction
                ...

        parser = TextPDFParser(content)
        result = parser.parse()
        ```

---

# Public API

This module re-exports the **recommended public entry points** of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.

- `Content`: Canonical content model.
- `ContentType`: Supported media types.
- `HTMLScraper`: HTTP-based HTML acquisition.
- `HTMLParser`: Base parser for HTML DOM interpretation.
- `FileSystemPDFClient`: Local filesystem PDF access.
- `PDFScraper`: PDF-specific content acquisition.
- `PDFParser`: Base parser for PDF binary interpretation.

---

# Core Philosophy

`OmniRead` is designed as a **decoupled content engine**:

1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither
   knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model,
   ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input
   is HTML, PDF, or JSON.

---