updated docs strings and added README.md
This commit is contained in:
108
README.md
Normal file
108
README.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# omniread
|
||||
|
||||
# Summary
|
||||
|
||||
`OmniRead` — format-agnostic content acquisition and parsing framework.
|
||||
|
||||
`OmniRead` provides a **cleanly layered architecture** for fetching, parsing,
|
||||
and normalizing content from heterogeneous sources such as HTML documents
|
||||
and PDF files.
|
||||
|
||||
The library is structured around three core concepts:
|
||||
|
||||
1. **`Content`**: A canonical, format-agnostic container representing raw content
|
||||
bytes and minimal contextual metadata.
|
||||
2. **`Scrapers`**: Components responsible for *acquiring* raw content from a
|
||||
source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret
|
||||
content.
|
||||
3. **`Parsers`**: Components responsible for *interpreting* acquired content and
|
||||
converting it into structured, typed representations.
|
||||
|
||||
`OmniRead` deliberately separates these responsibilities to ensure:
|
||||
|
||||
- Clear boundaries between IO and interpretation.
|
||||
- Replaceable implementations per format.
|
||||
- Predictable, testable behavior.
|
||||
|
||||
# Installation
|
||||
|
||||
Install `OmniRead` using pip:
|
||||
|
||||
```bash
|
||||
pip install omniread
|
||||
```
|
||||
|
||||
Install OmniRead using Poetry:
|
||||
```bash
|
||||
poetry add omniread
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick start
|
||||
|
||||
Example:
|
||||
HTML example:
|
||||
```python
|
||||
from omniread import HTMLScraper, HTMLParser
|
||||
|
||||
scraper = HTMLScraper()
|
||||
content = scraper.fetch("https://example.com")
|
||||
|
||||
class TitleParser(HTMLParser[str]):
|
||||
def parse(self) -> str:
|
||||
return self._soup.title.string
|
||||
|
||||
parser = TitleParser(content)
|
||||
title = parser.parse()
|
||||
```
|
||||
|
||||
PDF example:
|
||||
```python
|
||||
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
|
||||
from pathlib import Path
|
||||
|
||||
client = FileSystemPDFClient()
|
||||
scraper = PDFScraper(client=client)
|
||||
content = scraper.fetch(Path("document.pdf"))
|
||||
|
||||
class TextPDFParser(PDFParser[str]):
|
||||
def parse(self) -> str:
|
||||
# implement PDF text extraction
|
||||
...
|
||||
|
||||
parser = TextPDFParser(content)
|
||||
result = parser.parse()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Public API
|
||||
|
||||
This module re-exports the **recommended public entry points** of OmniRead.
|
||||
Consumers are encouraged to import from this namespace rather than from
|
||||
format-specific submodules directly, unless advanced customization is
|
||||
required.
|
||||
|
||||
- `Content`: Canonical content model.
|
||||
- `ContentType`: Supported media types.
|
||||
- `HTMLScraper`: HTTP-based HTML acquisition.
|
||||
- `HTMLParser`: Base parser for HTML DOM interpretation.
|
||||
- `FileSystemPDFClient`: Local filesystem PDF access.
|
||||
- `PDFScraper`: PDF-specific content acquisition.
|
||||
- `PDFParser`: Base parser for PDF binary interpretation.
|
||||
|
||||
---
|
||||
|
||||
# Core Philosophy
|
||||
|
||||
`OmniRead` is designed as a **decoupled content engine**:
|
||||
|
||||
1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither
|
||||
knows about the other.
|
||||
2. **Normalized Exchange**: All components communicate via the `Content` model,
|
||||
ensuring a consistent contract.
|
||||
3. **Format Agnosticism**: The core logic is independent of whether the input
|
||||
is HTML, PDF, or JSON.
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user