Schema¶
vertex-forager uses TableSchema objects to describe the canonical shape of each output table. Schemas define the target columns, types, unique keys, and optional data-quality rules that the mapper and writers rely on during normalization and persistence.
Use the schema reference when you want to:
- understand the columns and keys expected for a provider table
- configure strict validation with
SchemaMapper - inspect registry lookup behavior before writing custom integrations
Usage example¶
from vertex_forager.schema.registry import get_table_schema
schema = get_table_schema("yfinance_price")
if schema is not None:
print(schema.unique_key)
print(schema.analysis_date_col)
The registry function looks up a table name in the shared provider registry and returns the matching TableSchema if one exists. Providers such as Sharadar and yfinance register their schemas centrally so downstream normalization and writers can stay provider-agnostic.
TableSchema¶
Definition of a table's structural constraints.
Attributes:
| Name | Type | Description |
|---|---|---|
table |
str
|
Canonical table name (e.g., "sharadar_sep"). |
schema |
dict[str, DataType | type[DataType]]
|
Mapping of column names to Polars DataTypes. |
unique_key |
tuple[str, ...]
|
Tuple of column names that form the primary key (for deduplication). |
analysis_date_col |
str | None
|
Optional timestamp/analysis column used for time-based processing. Defaults to None. |
flexible_schema |
bool
|
Whether schema is permissive to extra/unknown fields. Defaults to False. |
quality_rules |
tuple[DataQualityRule, ...]
|
Tuple of data quality validation rules to apply to table data. Defaults to empty tuple. |
SchemaMapper¶
Core component responsible for data normalization and schema enforcement.
The SchemaMapper ensures that all data flowing through the pipeline conforms to pre-defined schemas before it reaches the Writer stage. This guarantees type safety and structural consistency across different storage backends.
Key Responsibilities:
1. Schema Lookup: Retrieves the authoritative TableSchema for a given table name
from the central registry.
2. Type Casting:
- Default (strict_validation=False): Casts columns with non-strict casting (strict=False),
allowing nulls on failure.
- Strict (strict_validation=True): Casts with strict=True and raises on type mismatches.
3. Missing Column Handling: Automatically adds missing schema columns with null
values to ensure downstream systems receive complete records.
4. Column Ordering: Reorders columns to match the canonical schema definition.
Usage
mapper = SchemaMapper() normalized_packet = mapper.normalize(raw_packet)
normalize(packet)
¶
Enforce schema conformance on a data packet.
This method transforms a raw DataFrame into a schema-compliant DataFrame.
If a schema is registered, the frame is cast to declared types and columns
are reordered. When analysis_date_col is set on the schema and present
in frame.columns, that column is cast to the schema type (strict=False)
and the frame is sorted by it. No new column is created if the target
analysis_date_col is absent.
If no schema is registered for the table, the packet is returned strictly as-is.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
packet
|
FramePacket
|
Input packet containing potentially raw/untyped data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FramePacket |
FramePacket
|
A new packet containing the normalized DataFrame. |