Analytics Engineering
A comprehensive reference for building scalable data infrastructure. I've documented proven patterns for dimensional modeling, data quality, and modern cloud architectures based on real-world implementations.
Dimensional Modeling Deep Dive
Fact Tables
The foundation of any dimensional model - storing quantifiable business metrics and events.
Transaction Facts
One row per business event (orders, clicks, payments)
fact_orders: order_id, customer_key, product_key, date_key, quantity, amount
Periodic Snapshot Facts
Regular intervals capturing status (daily inventory, account balances)
fact_daily_inventory: date_key, product_key, warehouse_key, quantity_on_hand, quantity_reserved
Accumulating Snapshot Facts
Tracks processes with defined milestones (order fulfillment, loan applications)
fact_order_fulfillment: order_key, ordered_date_key, shipped_date_key, delivered_date_key, cycle_time_hours
Dimension Tables
Descriptive attributes that provide context to facts - the "who, what, where, when, why, and how."
Conformed Dimensions
Shared across multiple fact tables for consistency
dim_date: used by fact_orders, fact_inventory, fact_marketing_campaigns
Role-Playing Dimensions
Same dimension used multiple times with different meanings
dim_date AS order_date, dim_date AS ship_date, dim_date AS delivery_date
Junk Dimensions
Consolidates low-cardinality flags and indicators
dim_order_flags: is_rush, is_gift, is_promotional, has_warranty
Slowly Changing Dimensions (SCD)
SCD Type Comparison
Type | Strategy | Use Case | Example |
---|---|---|---|
Type 0 | Retain Original | Fixed attributes that never change | customer.date_of_birth |
Type 1 | Overwrite | No history needed, current state only | customer.email |
Type 2 | Add New Row | Full history tracking required | customer.address (with valid_from/valid_to) |
Type 3 | Add Column | Track current and previous value only | product.current_price, product.previous_price |
SCD Type 2 Implementation
-- Creating SCD Type 2 dimension with surrogate keys
CREATE TABLE dim_customer (
customer_sk INT PRIMARY KEY, -- Surrogate key
customer_id VARCHAR(50), -- Natural key
customer_name VARCHAR(100),
email VARCHAR(100),
state VARCHAR(2),
customer_segment VARCHAR(50),
-- SCD Type 2 metadata
valid_from DATE,
valid_to DATE,
is_current BOOLEAN,
version_number INT,
-- Audit fields
created_date TIMESTAMP,
updated_date TIMESTAMP
);
-- Query to get current records
SELECT * FROM dim_customer WHERE is_current = TRUE;
-- Query to get historical state as of specific date
SELECT * FROM dim_customer
WHERE '2023-06-15' BETWEEN valid_from AND valid_to;
Schema Design Patterns
⭐ Star Schema
Structure: Single fact table surrounded by denormalized dimensions
Best for: Query performance, BI tool compatibility
Trade-off: Storage redundancy
fact_sales
├── dim_customer (denormalized)
├── dim_product (denormalized)
├── dim_store (denormalized)
└── dim_date
❄️ Snowflake Schema
Structure: Normalized dimension tables
Best for: Storage efficiency, data integrity
Trade-off: Query complexity, more joins
fact_sales
├── dim_customer
│ └── dim_geography
├── dim_product
│ ├── dim_category
│ └── dim_brand
└── dim_date
🌌 Galaxy Schema
Structure: Multiple fact tables sharing conformed dimensions
Best for: Enterprise-wide analytics
Trade-off: Complexity in maintaining conformed dimensions
fact_sales ←→ dim_customer ←→ fact_support
↓ ↓ ↓
dim_product dim_date dim_agent
🔗 Data Vault
Structure: Hubs (keys), Links (relationships), Satellites (attributes)
Best for: Auditability, agility, multiple sources
Trade-off: Implementation complexity
hub_customer
├── link_customer_order
│ └── hub_order
└── sat_customer_details
Engineering Excellence Checklist
Design Phase
- ✓Define clear grain for fact tables
- ✓Identify slowly changing dimensions
- ✓Plan for data quality from the start
- ✓Consider query patterns and performance
Implementation Phase
- ✓Implement idempotent pipelines
- ✓Add comprehensive testing
- ✓Document business logic
- ✓Monitor performance and costs
Ready to Build Better Data Infrastructure?
Whether you're modernizing legacy systems or building from scratch, I can help design and implement data infrastructure that scales.
See How This Powers Analytics →