How Built-in Providers Work¶

Overview¶

vertex-forager is built for users who consume the built-in providers shipped by the package. Today that means provider-specific integrations such as yfinance and Sharadar can look very different internally, but they still flow through one shared collection model.

The goal of the provider architecture is simple:

keep the user-facing client surface consistent
isolate vendor-specific request and parsing logic
normalize all provider outputs into one shared pipeline
make it possible to add more built-in providers over time without rewriting the pipeline

Current built-in providers¶

The package currently ships with these built-in providers:

yfinance
library-backed provider for market data and related datasets exposed through the yfinance Python library
sharadar
HTTP-backed provider for Nasdaq Data Link / Sharadar datasets

These two providers are useful examples of why the shared architecture exists: one relies on an internal library transport path, while the other uses normal HTTP requests, but both still enter the same normalization, persistence, metrics, result, and troubleshooting flow.

Current built-in provider model¶

Built-in providers share the same client-facing collection workflow
Users do not need to implement custom providers to use the package effectively
Provider internals are documented here to explain how built-in integrations work, not to present provider authoring as a primary workflow

This design also leaves room for future built-in providers to be added over time while preserving the same persistence, metrics, result, and troubleshooting model.

Built for future built-in providers¶

vertex-forager is intentionally structured so more built-in providers can be added over time without forcing users to learn a new collection model for each vendor.

That means future built-in providers should still be able to share:

the same create_client(...) entrypoint pattern
the same request → normalize → collect/write pipeline
the same result, metrics, persistence, and troubleshooting surface

This is an internal architecture goal for the package itself, not a statement that normal users are expected to build custom providers as part of the primary workflow.

What users should expect¶

From a user point of view, the most important consequence of this design is consistency. Built-in providers may fetch data differently internally, but they should still feel like one package when you use them.

That means users should expect built-in providers to share:

the same client creation pattern
the same request style for normal collection methods
the same in-memory vs persistence split
the same metrics, quality-check, and troubleshooting vocabulary
the same general result and write behavior once data enters the pipeline

Why providers are split this way¶

Client
exposes the user-facing collection methods and configuration surface
Router
turns a dataset request into fetch jobs and parses raw payloads into normalized frames
Pipeline
coordinates fetch, parse, normalize, quality checks, buffering, and write/collect behavior

This split keeps vendor-specific logic out of the shared pipeline while preserving a consistent API for built-in providers.

HTTP-backed vs library-backed providers¶

HTTP-backed providers such as Sharadar build normal http or https request specs
Library-backed providers such as yfinance route calls through an internal library transport layer
The transport choice is internal; users still work through the same client methods
Once payloads are normalized into shared frame/table structures, downstream pipeline stages remain provider-agnostic

Sharadar can optionally use caller-provided ticker metadata for smart batching. When a method receives meta=..., the client reads sharadar_tickers from that DuckDB and passes ticker, firstpricedate, and lastpricedate into the router. When meta=None, the router falls back to heuristic batching and correctness still comes from cursor-based pagination.

For a library-backed provider such as yfinance, the flow is:

a client method requests a dataset
the router emits a request spec with an internal scheme such as yfinance://...
the shared executor routes that spec to the registered library fetcher
the returned Python object is serialized across the fetch boundary
the router decodes and normalizes it into shared frames and tables

After that point, both library-backed and HTTP-backed providers follow the same downstream path for normalization, buffering, metrics, and write or collect behavior.

Router responsibilities¶

Routers are responsible for:

building fetch jobs for a dataset and symbol set
defining provider-specific request parameters and pagination behavior
parsing raw responses into FramePacket objects
mapping provider-specific errors into shared project exceptions
normalizing dataset and table naming so writes and in-memory collection stay consistent

Shared pipeline surface¶

The provider split is an implementation detail that supports a consistent built-in user experience. This architecture is why built-in providers can share:

one client creation flow
one persistence model
one metrics and result surface
one set of how-to and troubleshooting guides

Internal safety constraints¶

provider-specific logic stays out of the shared HTTP executor
private or dunder attribute access is blocked in the library-backed transport path
request specs must describe an allowed call shape
secrets should not appear in logs
invalid transport specs surface as typed failures instead of silent fallbacks

Internal modules¶

routers/base.py
BaseRouter contract and responsibilities (generate_jobs, parse)
routers/transforms.py
Common helpers for metadata injection, date parsing, empties, and column normalization
routers/jobs.py
Standard job builders and pagination context helpers
routers/errors.py
Provider-to-standard exception mapping
clients/dispatcher.py
Pipeline orchestration function to isolate Client from pipeline specifics
clients/validation.py
Reserved-key filtering for pipeline kwargs and shared client-side checks

Reserved Pipeline Kwargs¶

router
dataset
symbols
writer
mapper
on_progress
progress
resume

These keys are explicitly passed into pipeline.run and should be removed from kwargs forwarding using clients/validation.filter_reserved_kwargs.

on_progress now receives a single ProgressSnapshot model per completed job.
progress=True enables the built-in tqdm.auto display and prints a post-run summary.

Testing¶

Unit tests cover transforms, jobs, errors, and validation modules to ensure stable contracts
Full regression coverage confirms behavior parity across built-in providers

Examples¶

Error mapping
Quandl-style: raise_quandl_error(provider="sharadar", err=err) → FetchError
yfinance parse: raise_yfinance_parse_error(exc=exc, dataset=dataset) → TransformError; preserves UnpicklingError; maps HTTPError to FetchError
Job policy
Single ticker: single_symbol_job(provider=provider, dataset=dataset, symbol=symbol, url=url, params=params, auth=None, context={"symbol": symbol})
Pagination: pagination_job(provider=provider, dataset=dataset, url=url, params=params, auth=auth, context=make_pagination_context(meta_key=meta_key, cursor_param=cursor_param, max_pages=max_pages))

Pagination follow-up jobs do not bypass pipeline fairness:

The initial request enters the main request queue.
Follow-up page jobs enter the pagination scheduler lane.
That lane always uses Deficit Round Robin per symbol.
schedule.quantum controls how many pagination credits each symbol receives per DRR round.
schedule.max_pending_per_symbol optionally blocks one symbol from building an unbounded local queue.
schedule.backpressure_threshold optionally blocks the producer when total pending DRR work grows too large.

Diagrams¶

Flowchart — Router → Pipeline¶

flowchart TD
    A[Client call] --> B[Router.generate_jobs]
    B --> C[FetchJob]
    C --> D[Pipeline Producer]
    D --> E[Fetch Worker]
    E --> F[HttpExecutor.fetch]
    F --> G[Router.parse]
    G --> H[ParseResult]
    H --> I[FramePacket]
    H --> J{next_jobs?}
    J -->|yes| D
    J -->|no| K[Writer persist/collect]
    I --> K
    K --> L[Result]

Sequence — URL Build & Delivery¶

sequenceDiagram
    participant Client
    participant Router as ProviderRouter
    participant Builder as JobBuilder
    participant Auth as Auth
    participant Pipeline
    Client->>Router: generate_jobs(dataset, symbols)
    Router->>Router: resolve dataset URL
    loop For each symbol/batch
        Router->>Builder: build per‑symbol job
        Builder->>Auth: inject api_key
        Builder->>Builder: RequestSpec(url, params, auth)
        Builder-->>Router: FetchJob
        Router-->>Pipeline: yield job
    end
    Pipeline->>Pipeline: execute job.spec