omniread/omniread/__init__.py

"""
OmniRead — format-agnostic content acquisition and parsing framework.

OmniRead provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.

The library is structured around three core concepts:

1. **Content**
   A canonical, format-agnostic container representing raw content bytes
   and minimal contextual metadata.

2. **Scrapers**
   Components responsible for *acquiring* raw content from a source
   (HTTP, filesystem, object storage, etc.). Scrapers never interpret
   content.

3. **Parsers**
   Components responsible for *interpreting* acquired content and
   converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation
- Replaceable implementations per format
- Predictable, testable behavior

----------------------------------------------------------------------
Installation
----------------------------------------------------------------------

Install OmniRead using pip:

    pip install omniread

Or with Poetry:

    poetry add omniread

----------------------------------------------------------------------
Basic Usage
----------------------------------------------------------------------

HTML example:

    from omniread import HTMLScraper, HTMLParser

    scraper = HTMLScraper()
    content = scraper.fetch("https://example.com")

    class TitleParser(HTMLParser[str]):
        def parse(self) -> str:
            return self._soup.title.string

    parser = TitleParser(content)
    title = parser.parse()

PDF example:

    from omniread import FileSystemPDFClient, PDFScraper, PDFParser
    from pathlib import Path

    client = FileSystemPDFClient()
    scraper = PDFScraper(client=client)
    content = scraper.fetch(Path("document.pdf"))

    class TextPDFParser(PDFParser[str]):
        def parse(self) -> str:
            # implement PDF text extraction
            ...

    parser = TextPDFParser(content)
    result = parser.parse()

----------------------------------------------------------------------
Public API Surface
----------------------------------------------------------------------

This module re-exports the **recommended public entry points** of OmniRead.

Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.

Core:
- Content
- ContentType

HTML:
- HTMLScraper
- HTMLParser

PDF:
- FileSystemPDFClient
- PDFScraper
- PDFParser

## Core Philosophy

`OmniRead` is designed as a **decoupled content engine**:

1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON.

## Documentation Design

For those extending `OmniRead`, follow these "AI-Native" docstring principles:

### For Humans
- **Clear Contracts**: Explicitly state what a component is and is NOT responsible for.
- **Runnable Examples**: Include small, logical snippets in the package `__init__.py`.

### For LLMs
- **Structured Models**: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
- **Type Safety**: All public APIs must be fully typed and have corresponding `.pyi` stubs.
- **Detailed Raises**: Include `: description` pairs in the `Raises` section to help agents handle errors gracefully.
"""

from .core import Content, ContentType
from .html import HTMLScraper, HTMLParser
from .pdf import FileSystemPDFClient, PDFScraper, PDFParser

__all__ = [
    # core
    "Content",
    "ContentType",

    # html
    "HTMLScraper",
    "HTMLParser",

    # pdf
    "FileSystemPDFClient",
    "PDFScraper",
    "PDFParser",
]