{ "module": "omniread.html.scraper", "content": { "path": "omniread.html.scraper", "docstring": "# Summary\n\nHTML scraping implementation for OmniRead.\n\nThis module provides an HTTP-based scraper for retrieving HTML documents.\nIt implements the core `BaseScraper` contract using `httpx` as the transport\nlayer.\n\nThis scraper is responsible for:\n\n- Fetching raw HTML bytes over HTTP(S)\n- Validating response content type\n- Attaching HTTP metadata to the returned content\n\nThis scraper is not responsible for:\n\n- Parsing or interpreting HTML\n- Retrying failed requests\n- Managing crawl policies or rate limiting", "objects": { "httpx": { "name": "httpx", "kind": "alias", "path": "omniread.html.scraper.httpx", "signature": "", "docstring": null }, "Any": { "name": "Any", "kind": "alias", "path": "omniread.html.scraper.Any", "signature": "", "docstring": null }, "Mapping": { "name": "Mapping", "kind": "alias", "path": "omniread.html.scraper.Mapping", "signature": "", "docstring": null }, "Optional": { "name": "Optional", "kind": "alias", "path": "omniread.html.scraper.Optional", "signature": "", "docstring": null }, "Content": { "name": "Content", "kind": "class", "path": "omniread.html.scraper.Content", "signature": "", "docstring": "Normalized representation of extracted content.\n\nNotes:\n **Responsibilities:**\n\n - A `Content` instance represents a raw content payload along with\n minimal contextual metadata describing its origin and type.\n - This class is the primary exchange format between scrapers,\n parsers, and downstream consumers.", "members": { "raw": { "name": "raw", "kind": "attribute", "path": "omniread.html.scraper.Content.raw", "signature": "", "docstring": "Raw content bytes as retrieved from the source." }, "source": { "name": "source", "kind": "attribute", "path": "omniread.html.scraper.Content.source", "signature": "", "docstring": "Identifier of the content origin (URL, file path, or logical name)." }, "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.html.scraper.Content.content_type", "signature": "", "docstring": "Optional MIME type of the content, if known." }, "metadata": { "name": "metadata", "kind": "attribute", "path": "omniread.html.scraper.Content.metadata", "signature": "", "docstring": "Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes)." } } }, "ContentType": { "name": "ContentType", "kind": "class", "path": "omniread.html.scraper.ContentType", "signature": "", "docstring": "Supported MIME types for extracted content.\n\nNotes:\n **Guarantees:**\n\n - This enum represents the declared or inferred media type of the\n content source.\n - It is primarily used for routing content to the appropriate\n parser or downstream consumer.", "members": { "HTML": { "name": "HTML", "kind": "attribute", "path": "omniread.html.scraper.ContentType.HTML", "signature": "", "docstring": "HTML document content." }, "PDF": { "name": "PDF", "kind": "attribute", "path": "omniread.html.scraper.ContentType.PDF", "signature": "", "docstring": "PDF document content." }, "JSON": { "name": "JSON", "kind": "attribute", "path": "omniread.html.scraper.ContentType.JSON", "signature": "", "docstring": "JSON document content." }, "XML": { "name": "XML", "kind": "attribute", "path": "omniread.html.scraper.ContentType.XML", "signature": "", "docstring": "XML document content." } } }, "BaseScraper": { "name": "BaseScraper", "kind": "class", "path": "omniread.html.scraper.BaseScraper", "signature": "", "docstring": "Base interface for all scrapers.\n\nNotes:\n **Responsibilities:**\n\n - A scraper is responsible ONLY for fetching raw content (bytes)\n from a source. It must not interpret or parse it.\n - A scraper is a stateless acquisition component that retrieves raw\n content from a source and returns it as a `Content` object.\n - Scrapers define how content is obtained, not what the content means.\n - Implementations may vary in transport mechanism, authentication\n strategy, retry and backoff behavior.\n\n **Constraints:**\n\n - Implementations must not parse content, modify content semantics,\n or couple scraping logic to a specific parser.", "members": { "fetch": { "name": "fetch", "kind": "function", "path": "omniread.html.scraper.BaseScraper.fetch", "signature": "", "docstring": "Fetch raw content from the given source.\n\nArgs:\n source (str):\n Location identifier (URL, file path, S3 URI, etc.).\n\n metadata (Optional[Mapping[str, Any]], optional):\n Optional hints for the scraper (headers, auth, etc.).\n\nReturns:\n Content:\n Content object containing raw bytes and metadata.\n\nRaises:\n Exception:\n Retrieval-specific errors as defined by the implementation.\n\nNotes:\n **Responsibilities:**\n\n - Implementations must retrieve the content referenced by `source`\n and return it as raw bytes wrapped in a `Content` object." } } }, "HTMLScraper": { "name": "HTMLScraper", "kind": "class", "path": "omniread.html.scraper.HTMLScraper", "signature": "", "docstring": "Base HTML scraper using `httpx`.\n\nNotes:\n **Responsibilities:**\n\n - This scraper retrieves HTML documents over HTTP(S) and returns\n them as raw content wrapped in a `Content` object.\n - Fetches raw bytes and metadata only.\n - The scraper uses `httpx.Client` for HTTP requests, enforces an\n HTML content type, and preserves HTTP response metadata.\n\n **Constraints:**\n\n - The scraper does not: Parse HTML, perform retries or backoff,\n handle non-HTML responses.", "members": { "content_type": { "name": "content_type", "kind": "attribute", "path": "omniread.html.scraper.HTMLScraper.content_type", "signature": null, "docstring": null }, "validate_content_type": { "name": "validate_content_type", "kind": "function", "path": "omniread.html.scraper.HTMLScraper.validate_content_type", "signature": "", "docstring": "Validate that the HTTP response contains HTML content.\n\nArgs:\n response (httpx.Response):\n HTTP response returned by `httpx`.\n\nRaises:\n ValueError:\n If the `Content-Type` header is missing or does not indicate HTML content." }, "fetch": { "name": "fetch", "kind": "function", "path": "omniread.html.scraper.HTMLScraper.fetch", "signature": "", "docstring": "Fetch an HTML document from the given source.\n\nArgs:\n source (str):\n URL of the HTML document.\n metadata (Optional[Mapping[str, Any]], optional):\n Optional metadata to be merged into the returned content.\n\nReturns:\n Content:\n A `Content` instance containing raw HTML bytes, source URL, HTML content type, and HTTP response metadata.\n\nRaises:\n httpx.HTTPError:\n If the HTTP request fails.\n ValueError:\n If the response is not valid HTML." } } } } } }