updated docs strings and added README.md

2026-03-08 17:59:56 +05:30
parent 0fbf0ca0f0
commit de7d04eb1a
26 changed files with 546 additions and 406 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,108 @@
+# omniread
+
+# Summary
+
+`OmniRead` — format-agnostic content acquisition and parsing framework.
+
+`OmniRead` provides a **cleanly layered architecture** for fetching, parsing,
+and normalizing content from heterogeneous sources such as HTML documents
+and PDF files.
+
+The library is structured around three core concepts:
+
+1.  **`Content`**: A canonical, format-agnostic container representing raw content
+    bytes and minimal contextual metadata.
+2.  **`Scrapers`**: Components responsible for *acquiring* raw content from a
+    source (HTTP, filesystem, object storage, etc.). `Scrapers` never interpret
+    content.
+3.  **`Parsers`**: Components responsible for *interpreting* acquired content and
+    converting it into structured, typed representations.
+
+`OmniRead` deliberately separates these responsibilities to ensure:
+
+-   Clear boundaries between IO and interpretation.
+-   Replaceable implementations per format.
+-   Predictable, testable behavior.
+
+# Installation
+
+Install `OmniRead` using pip:
+
+```bash
+pip install omniread
+```
+
+Install OmniRead using Poetry:
+```bash
+poetry add omniread
+```
+
+---
+
+## Quick start
+
+Example:
+    HTML example:
+        ```python
+        from omniread import HTMLScraper, HTMLParser
+
+        scraper = HTMLScraper()
+        content = scraper.fetch("https://example.com")
+
+        class TitleParser(HTMLParser[str]):
+            def parse(self) -> str:
+                return self._soup.title.string
+
+        parser = TitleParser(content)
+        title = parser.parse()
+        ```
+
+    PDF example:
+        ```python
+        from omniread import FileSystemPDFClient, PDFScraper, PDFParser
+        from pathlib import Path
+
+        client = FileSystemPDFClient()
+        scraper = PDFScraper(client=client)
+        content = scraper.fetch(Path("document.pdf"))
+
+        class TextPDFParser(PDFParser[str]):
+            def parse(self) -> str:
+                # implement PDF text extraction
+                ...
+
+        parser = TextPDFParser(content)
+        result = parser.parse()
+        ```
+
+---
+
+# Public API
+
+This module re-exports the **recommended public entry points** of OmniRead.
+Consumers are encouraged to import from this namespace rather than from
+format-specific submodules directly, unless advanced customization is
+required.
+
+- `Content`: Canonical content model.
+- `ContentType`: Supported media types.
+- `HTMLScraper`: HTTP-based HTML acquisition.
+- `HTMLParser`: Base parser for HTML DOM interpretation.
+- `FileSystemPDFClient`: Local filesystem PDF access.
+- `PDFScraper`: PDF-specific content acquisition.
+- `PDFParser`: Base parser for PDF binary interpretation.
+
+---
+
+# Core Philosophy
+
+`OmniRead` is designed as a **decoupled content engine**:
+
+1. **Separation of Concerns**: Scrapers *fetch*, Parsers *interpret*. Neither
+   knows about the other.
+2. **Normalized Exchange**: All components communicate via the `Content` model,
+   ensuring a consistent contract.
+3. **Format Agnosticism**: The core logic is independent of whether the input
+   is HTML, PDF, or JSON.
+
+---