Data Vault 2.1 — Glossary¶
Definitions for terms used in strategy.md and patterns.md. Where DV 2.1 changed or clarified a 2.0 definition, the change is called out.
Layers¶
Stage — Source-shaped landing tables with audit and hash columns added. Not part of the vault proper; thrown away on reload.
Raw Vault — Insert-only, schema-on-write integration layer. Hubs, links, satellites, references. No business rules.
Business Vault — Insert-only derivation layer. Same construct types as raw, plus PIT and bridge. Where business rules live.
Information Mart — Consumption layer. Star, snowflake, OBT, graph. Disposable; rebuilt from vault.
Constructs¶
Hub — A unique business entity. One row per natural business key. The integration anchor for a concept (e.g. hub_customer).
Link — A relationship between two or more hubs. Modeled as its own table so that the relationship has its own history.
Satellite (Sat) — Descriptive context for a hub or link, with history. Multiple sats per parent is normal — split by source and by sensitivity.
Reference (Ref) — Code lookups (country code, currency, status). Either a flat ref_* table or a thin hub + sat for ones that need history.
Multi-Active Satellite (MAS) — A satellite with multiple active rows per parent at the same load date. Use only for genuine 1:N values, not for hierarchies.
Effectivity Satellite (Eff-Sat) — A satellite that tracks when a link is active, with start_date, end_date, is_active. Business vault construct.
Same-As Link (SAS Link) — A link that unifies entities across sources after identity resolution. Business vault construct.
PIT (Point-in-Time) — A materialized snapshot per hub + load_date giving the latest sat row pointers at each snapshot. Business vault construct. Cuts query cost when reconstructing entity state at time T.
Bridge — A denormalized projection of a multi-hop link path, for query performance. Business vault construct.
Keys and audit¶
Business Key (BK) — The natural identifier of a business concept (customer ID, order number, ISBN). Sometimes composite.
Hash Key (HK) — A SHA-256 hash of the canonicalized business key. Replaces sequence surrogate keys. Columns named <entity>_hk. Enables parallel load and cross-platform consistency.
Hash Diff (HD) — A SHA-256 hash of the concatenated descriptive attributes (alphabetical order) on a satellite, used for delta detection. Columns named <sat>_hd.
Load Date (LDTS) — UTC timestamp of when the row entered the warehouse. Microsecond precision. Mandatory on every vault row.
Record Source (RSRC) — Opaque string identifying the source (e.g. crm.salesforce.contact, business-rules.customer-metrics-v1). Mandatory on every vault row.
Ghost Hub Row — A placeholder hub row inserted with record_source='deferred' when a link references a business key whose parent event hasn't arrived yet. Replaced (by deduplication on hash key) when the real event lands.
Hashing¶
SHA-256 — The mandated DV 2.1 hash algorithm. Hex-encoded, lowercase, 64 chars. MD5 is forbidden (collision risk at warehouse scale).
Canonical Ordering — The deterministic rule for concatenating attributes before hashing. For hash keys: business-key order as declared in the model. For hash diffs: alphabetical by column name.
Null Sentinel — The literal string used in place of NULL during hashing. Default ^^. Critical because NULL in a business key is a meaningful value.
Unicode NFC Normalization — Canonical Composition form. Run on every string before hashing so é (one code point) and e + ́ (two code points) hash identically.
Load mechanics¶
Insert-Only — The vault never updates or deletes. New facts arrive as new rows. The cornerstone of auditability.
Idempotent Load — Rerunning the same load produces zero new rows. Achieved through hash-key uniqueness and hash-diff comparison.
Parallel Load — All hubs can load in parallel; all links and sats can load in parallel after hubs. The only ordering constraint.
Late-Arriving Key — A link references a hub whose row hasn't been loaded yet. Solved by ghost hubs (DV 2.1 first-class pattern).
Micro-Batch — A streaming load shape where events are buffered for N seconds (or N events) and then batch-loaded. DV 2.1's preferred streaming pattern.
What changed in 2.1 vs 2.0¶
| Term | 2.0 | 2.1 |
|---|---|---|
| Real-time load | "Possible but use carefully" | First-class with explicit micro-batch and continuous patterns; ghost-hub pattern formalized |
| NoSQL / graph | Out of scope | In scope as a mart projection target |
| Self-service BI | Implicit | Information mart layer formalized with star / snowflake / OBT / graph projections and virtualize-by-default rule |
| Business vault patterns | "Use as needed" | Formal catalog: same-as links, computed sats, PIT, bridge; materialization decision tree |
| PIT/Bridge classification | Often called "query assistance" | Explicitly business vault constructs |
| Hash algorithm | MD5 acceptable | SHA-256 mandated for new vaults |
| Tool model | ETL-tool vendor implied | dbt / ELT / macro-driven assumed |
Adjacent concepts (often confused)¶
Anchor Modeling — A different methodology, not DV. Similar goals (integration, history) but different shape (no hubs/links distinction, attribute-per-table).
Activity Schema — A single-table event-log warehouse pattern. Complementary to DV (can be a mart), not a replacement.
One Big Table (OBT) — A wide denormalized mart shape. A projection from DV, not an alternative to DV.
Lakehouse Medallion (Bronze/Silver/Gold) — A storage layering convention from Databricks. Roughly: bronze ≈ stage, silver ≈ raw vault, gold ≈ marts. The DV vocabulary is more precise about what's in each layer and why.
SCD Type 2 — A dimensional pattern for tracking history in a star schema. The raw vault's satellite is a generalized SCD2; marts built from the vault may flatten it back into SCD2 dims if BI tools expect them.
Kimball Bus Matrix — A planning artifact for star schema design. Useful as an input to DV modeling (it identifies the conformed dimensions ≈ hubs) but doesn't replace DV's integration role.