# omniread # Summary `OmniRead` — format-agnostic content acquisition and parsing framework. `OmniRead` provides a **cleanly layered architecture** for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files. The library is structured around three core concepts: 1. **`Content`**: A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata. 2. **`Scrapers`**: Components responsible for *acquiring* raw content from a source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret content. 3. **`Parsers`**: Components responsible for *interpreting* acquired content and converting it into structured, typed representations. `OmniRead` deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation. - Replaceable implementations per format. - Predictable, testable behavior. # Installation Install `OmniRead` using pip: ```bash pip install omniread ``` Install OmniRead using Poetry: ```bash poetry add omniread ``` --- ## Quick start Example: HTML example: ```python from omniread import HTMLScraper, HTMLParser scraper = HTMLScraper() content = scraper.fetch("https://example.com") class TitleParser(HTMLParser[str]): def parse(self) -> str: return self._soup.title.string parser = TitleParser(content) title = parser.parse() ``` PDF example: ```python from omniread import FileSystemPDFClient, PDFScraper, PDFParser from pathlib import Path client = FileSystemPDFClient() scraper = PDFScraper(client=client) content = scraper.fetch(Path("document.pdf")) class TextPDFParser(PDFParser[str]): def parse(self) -> str: # implement PDF text extraction ... parser = TextPDFParser(content) result = parser.parse() ``` --- # Public API This module re-exports the **recommended public entry points** of OmniRead. Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required. - `Content`: Canonical content model. - `ContentType`: Supported media types. - `HTMLScraper`: HTTP-based HTML acquisition. - `HTMLParser`: Base parser for HTML DOM interpretation. - `FileSystemPDFClient`: Local filesystem PDF access. - `PDFScraper`: PDF-specific content acquisition. - `PDFParser`: Base parser for PDF binary interpretation. --- # Core Philosophy `OmniRead` is designed as a **decoupled content engine**: 1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither knows about the other. 2. **Normalized Exchange**: All components communicate via the `Content` model, ensuring a consistent contract. 3. **Format Agnosticism**: The core logic is independent of whether the input is HTML, PDF, or JSON. ---