Analytics Engineering

A comprehensive reference for building scalable data infrastructure. I've documented proven patterns for dimensional modeling, data quality, and modern cloud architectures based on real-world implementations.

Dimensional Modeling Deep Dive

Fact Tables

The foundation of any dimensional model - storing quantifiable business metrics and events.

Transaction Facts

One row per business event (orders, clicks, payments)

fact_orders: order_id, customer_key, product_key, date_key, quantity, amount
Periodic Snapshot Facts

Regular intervals capturing status (daily inventory, account balances)

fact_daily_inventory: date_key, product_key, warehouse_key, quantity_on_hand, quantity_reserved
Accumulating Snapshot Facts

Tracks processes with defined milestones (order fulfillment, loan applications)

fact_order_fulfillment: order_key, ordered_date_key, shipped_date_key, delivered_date_key, cycle_time_hours

Dimension Tables

Descriptive attributes that provide context to facts - the "who, what, where, when, why, and how."

Conformed Dimensions

Shared across multiple fact tables for consistency

dim_date: used by fact_orders, fact_inventory, fact_marketing_campaigns
Role-Playing Dimensions

Same dimension used multiple times with different meanings

dim_date AS order_date, dim_date AS ship_date, dim_date AS delivery_date
Junk Dimensions

Consolidates low-cardinality flags and indicators

dim_order_flags: is_rush, is_gift, is_promotional, has_warranty

Slowly Changing Dimensions (SCD)

SCD Type Comparison

TypeStrategyUse CaseExample
Type 0Retain OriginalFixed attributes that never changecustomer.date_of_birth
Type 1OverwriteNo history needed, current state onlycustomer.email
Type 2Add New RowFull history tracking requiredcustomer.address (with valid_from/valid_to)
Type 3Add ColumnTrack current and previous value onlyproduct.current_price, product.previous_price

SCD Type 2 Implementation

-- Creating SCD Type 2 dimension with surrogate keys
CREATE TABLE dim_customer (
    customer_sk INT PRIMARY KEY,        -- Surrogate key
    customer_id VARCHAR(50),            -- Natural key
    customer_name VARCHAR(100),
    email VARCHAR(100),
    state VARCHAR(2),
    customer_segment VARCHAR(50),
    
    -- SCD Type 2 metadata
    valid_from DATE,
    valid_to DATE,
    is_current BOOLEAN,
    version_number INT,
    
    -- Audit fields
    created_date TIMESTAMP,
    updated_date TIMESTAMP
);

-- Query to get current records
SELECT * FROM dim_customer WHERE is_current = TRUE;

-- Query to get historical state as of specific date
SELECT * FROM dim_customer 
WHERE '2023-06-15' BETWEEN valid_from AND valid_to;

Schema Design Patterns

⭐ Star Schema

Structure: Single fact table surrounded by denormalized dimensions

Best for: Query performance, BI tool compatibility

Trade-off: Storage redundancy

fact_sales
    ├── dim_customer (denormalized)
    ├── dim_product (denormalized)
    ├── dim_store (denormalized)
    └── dim_date

❄️ Snowflake Schema

Structure: Normalized dimension tables

Best for: Storage efficiency, data integrity

Trade-off: Query complexity, more joins

fact_sales
    ├── dim_customer
    │   └── dim_geography
    ├── dim_product
    │   ├── dim_category
    │   └── dim_brand
    └── dim_date

🌌 Galaxy Schema

Structure: Multiple fact tables sharing conformed dimensions

Best for: Enterprise-wide analytics

Trade-off: Complexity in maintaining conformed dimensions

fact_sales ←→ dim_customer ←→ fact_support
    ↓              ↓              ↓
dim_product    dim_date     dim_agent

🔗 Data Vault

Structure: Hubs (keys), Links (relationships), Satellites (attributes)

Best for: Auditability, agility, multiple sources

Trade-off: Implementation complexity

hub_customer
    ├── link_customer_order
    │   └── hub_order
    └── sat_customer_details

Engineering Excellence Checklist

Design Phase

  • Define clear grain for fact tables
  • Identify slowly changing dimensions
  • Plan for data quality from the start
  • Consider query patterns and performance

Implementation Phase

  • Implement idempotent pipelines
  • Add comprehensive testing
  • Document business logic
  • Monitor performance and costs

Ready to Build Better Data Infrastructure?

Whether you're modernizing legacy systems or building from scratch, I can help design and implement data infrastructure that scales.

See How This Powers Analytics →