updated docs strings and added README.md

This commit is contained in:
2026-03-08 17:59:56 +05:30
parent 0fbf0ca0f0
commit de7d04eb1a
26 changed files with 546 additions and 406 deletions

108
README.md Normal file
View File

@@ -0,0 +1,108 @@
# omniread
# Summary
`OmniRead` — format-agnostic content acquisition and parsing framework.
`OmniRead` provides a **cleanly layered architecture** for fetching, parsing,
and normalizing content from heterogeneous sources such as HTML documents
and PDF files.
The library is structured around three core concepts:
1. **`Content`**: A canonical, format-agnostic container representing raw content
bytes and minimal contextual metadata.
2. **`Scrapers`**: Components responsible for *acquiring* raw content from a
source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret
content.
3. **`Parsers`**: Components responsible for *interpreting* acquired content and
converting it into structured, typed representations.
`OmniRead` deliberately separates these responsibilities to ensure:
- Clear boundaries between IO and interpretation.
- Replaceable implementations per format.
- Predictable, testable behavior.
# Installation
Install `OmniRead` using pip:
```bash
pip install omniread
```
Install OmniRead using Poetry:
```bash
poetry add omniread
```
---
## Quick start
Example:
HTML example:
```python
from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch("https://example.com")
class TitleParser(HTMLParser[str]):
def parse(self) -> str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
```
PDF example:
```python
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))
class TextPDFParser(PDFParser[str]):
def parse(self) -> str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
```
---
# Public API
This module re-exports the **recommended public entry points** of OmniRead.
Consumers are encouraged to import from this namespace rather than from
format-specific submodules directly, unless advanced customization is
required.
- `Content`: Canonical content model.
- `ContentType`: Supported media types.
- `HTMLScraper`: HTTP-based HTML acquisition.
- `HTMLParser`: Base parser for HTML DOM interpretation.
- `FileSystemPDFClient`: Local filesystem PDF access.
- `PDFScraper`: PDF-specific content acquisition.
- `PDFParser`: Base parser for PDF binary interpretation.
---
# Core Philosophy
`OmniRead` is designed as a **decoupled content engine**:
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither
knows about the other.
2. **Normalized Exchange**: All components communicate via the `Content` model,
ensuring a consistent contract.
3. **Format Agnosticism**: The core logic is independent of whether the input
is HTML, PDF, or JSON.
---