Skip to content

Scraper

omniread.core.scraper

Summary

Abstract scraping contracts for OmniRead.

This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.

Scrapers are responsible for:

  • Locating and retrieving raw content bytes
  • Attaching minimal contextual metadata
  • Returning normalized Content objects

Scrapers are explicitly NOT responsible for:

  • Parsing or interpreting content
  • Inferring structure or semantics
  • Performing content-type specific processing

All interpretation must be delegated to parsers.

Classes

BaseScraper

Bases: ABC

Base interface for all scrapers.

Notes

Responsibilities:

1
2
3
4
5
6
7
- A scraper is responsible ONLY for fetching raw content (bytes)
  from a source. It must not interpret or parse it.
- A scraper is a stateless acquisition component that retrieves raw
  content from a source and returns it as a `Content` object.
- Scrapers define how content is obtained, not what the content means.
- Implementations may vary in transport mechanism, authentication
  strategy, retry and backoff behavior.

Constraints:

1
2
- Implementations must not parse content, modify content semantics,
  or couple scraping logic to a specific parser.
Functions
fetch abstractmethod
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch raw content from the given source.

Parameters:

Name Type Description Default
source str

Location identifier (URL, file path, S3 URI, etc.).

required
metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.).

None

Returns:

Name Type Description
Content Content

Content object containing raw bytes and metadata.

Raises:

Type Description
Exception

Retrieval-specific errors as defined by the implementation.

Notes

Responsibilities:

1
2
- Implementations must retrieve the content referenced by `source`
  and return it as raw bytes wrapped in a `Content` object.