Skip to content

Schema

vertex-forager uses TableSchema objects to describe the canonical shape of each output table. Schemas define the target columns, types, unique keys, and optional data-quality rules that the mapper and writers rely on during normalization and persistence.

Use the schema reference when you want to:

  • understand the columns and keys expected for a provider table
  • configure strict validation with SchemaMapper
  • inspect registry lookup behavior before writing custom integrations

Usage example

from vertex_forager.schema.registry import get_table_schema

schema = get_table_schema("yfinance_price")
if schema is not None:
    print(schema.unique_key)
    print(schema.analysis_date_col)

The registry function looks up a table name in the shared provider registry and returns the matching TableSchema if one exists. Providers such as Sharadar and yfinance register their schemas centrally so downstream normalization and writers can stay provider-agnostic.

TableSchema

Definition of a table's structural constraints.

Attributes:

Name Type Description
table str

Canonical table name (e.g., "sharadar_sep").

schema dict[str, DataType | type[DataType]]

Mapping of column names to Polars DataTypes.

unique_key tuple[str, ...]

Tuple of column names that form the primary key (for deduplication).

analysis_date_col str | None

Optional timestamp/analysis column used for time-based processing. Defaults to None.

flexible_schema bool

Whether schema is permissive to extra/unknown fields. Defaults to False.

quality_rules tuple[DataQualityRule, ...]

Tuple of data quality validation rules to apply to table data. Defaults to empty tuple.

SchemaMapper

Core component responsible for data normalization and schema enforcement.

The SchemaMapper ensures that all data flowing through the pipeline conforms to pre-defined schemas before it reaches the Writer stage. This guarantees type safety and structural consistency across different storage backends.

Key Responsibilities: 1. Schema Lookup: Retrieves the authoritative TableSchema for a given table name from the central registry. 2. Type Casting: - Default (strict_validation=False): Casts columns with non-strict casting (strict=False), allowing nulls on failure. - Strict (strict_validation=True): Casts with strict=True and raises on type mismatches. 3. Missing Column Handling: Automatically adds missing schema columns with null values to ensure downstream systems receive complete records. 4. Column Ordering: Reorders columns to match the canonical schema definition.

Usage

mapper = SchemaMapper() normalized_packet = mapper.normalize(raw_packet)

normalize(packet)

Enforce schema conformance on a data packet.

This method transforms a raw DataFrame into a schema-compliant DataFrame. If a schema is registered, the frame is cast to declared types and columns are reordered. When analysis_date_col is set on the schema and present in frame.columns, that column is cast to the schema type (strict=False) and the frame is sorted by it. No new column is created if the target analysis_date_col is absent.

If no schema is registered for the table, the packet is returned strictly as-is.

Parameters:

Name Type Description Default
packet FramePacket

Input packet containing potentially raw/untyped data.

required

Returns:

Name Type Description
FramePacket FramePacket

A new packet containing the normalized DataFrame.

Registry

Retrieve the schema for a given table name.