Data Warehouse Pack

Data warehouse design with star schema slowly changing dimensions partitioning and query optimization Install with one command: npx quanta-skills install data-warehouse-pack

We've all inherited a warehouse that's really just a database with a SELECT habit. You open the SQL, and there's no star schema. Just a massive denormalized table with duplicate customer addresses and a fact table that doesn't sum correctly because the grain is wrong. Slowly changing dimensions are the usual culprit. You need to track when a customer moves or changes their tier, but the schema overwrites the old value. Or worse, you're trying to implement Type 2 history manually, writing MERGE statements that break on race conditions. You're spending hours debugging why dim_customer has two "current" records, or why the fact table is missing rows. This isn't data engineering; it's triage.

Install this skill

npx quanta-skills install data-warehouse-pack

Requires a Pro subscription. See pricing.

A slowly changing dimension (SCD) is a framework for updating and maintaining data stored in dimension tables as dimensions change [3]. Without a disciplined approach, you end up with data that's useless for trend analysis. Using Type 1 for simple corrections might work for a log, but it destroys history [4]. You need Type 2 for full tracking, and getting that right is harder than it looks. If you're also building the upstream ingestion, check out the data-lake-pack for medallion layer patterns, but even with clean upstream data, a flat-file mindset in the warehouse will sink you.

The Warehouse Trap: Flat Tables, Missing History, and Shuffle Spills

Every hour you spend manually fixing SCD logic is an hour you're not shipping. But the real cost is compute and trust. When your partitioning strategy is guesswork, Spark shuffles terabytes across the network. A misconfigured DISTRIBUTE BY can turn a 5-minute query into a 4-hour spill. Performance guidelines emphasize that table management and partitioning directly impact query speed [2]. If you're bucketing by the wrong column, you're paying for data scans that return zero rows.

Then there's the trust cost. When the CFO asks why Q3 returns don't match Q3 sales because the customer's loyalty tier changed mid-quarter and the warehouse lost that history, you lose credibility. Implementing SCDs requires careful handling of effective dates and surrogate keys [6]. If you get this wrong, downstream BI tools show ghost records and missing updates. You're not just fixing SQL; you're rebuilding the foundation of your analytics stack. This pack pairs with the dbt-analytics-pack for end-to-end modeling workflows, ensuring your dimensional models are tested and deployed with confidence.

What Bad Schema Design Costs You in Compute and Trust

A misaligned partition key doesn't just slow down queries; it inflates your cloud bill. Spark jobs that ignore data skew can waste 80% of their compute time on stragglers. When you're running incremental loads on a fact_sales table without proper partitioning hints, you're reprocessing data that hasn't changed. The validate_project.sh script in this pack catches these issues early, but if you're designing from scratch, you need the patterns baked in. Unlike the database-design-pack which focuses on OLTP normalization, this pack targets analytical star schemas where denormalization and history tracking are features, not bugs.

When your SCD implementation lacks is_current flags or has overlapping date ranges, your aggregations double-count revenue. We've seen teams lose days of engineering time reconciling these discrepancies. The cost isn't just the engineer's hour; it's the delayed dashboard, the missed campaign, the angry stakeholder. If you're designing dimensions, remember that star schema differs from the normalization goals in the database-design-pack. You need surrogate keys to decouple business keys from history, and you need window functions to manage effective dates without manual merge hacks.

A Retail Team's Migration from Monolith to Star Schema

Imagine a retail analytics team migrating from a legacy ERP dump to a cloud warehouse. They start with a flat file export. By week two, they have a fact_sales table that duplicates customer records on every load. The dim_customer table overwrites the previous address, so the finance team can't reconcile returns against the original purchase. The Spark jobs are slow because everything is partitioned by load_date while the query filters are on transaction_date. The team is drowning in SELECT queries and broken historical reports.

Star schema design is critical for analytical workloads, separating facts from dimensions to optimize query performance [1]. In this scenario, the team needed Type 2 to track address changes, but their flat file approach made that impossible without a complete refactor [5]. SCD types range from overwriting to full history, and choosing the right type depends on the business requirement. They needed to preserve the address at the time of sale while tracking the customer's current location. This required a surrogate key in the dimension, effective_date and end_date columns, and a flag to identify the current record. Without these patterns, the warehouse is just a slower version of the ERP.

Star Schema, SCD Type 2, and Spark Optimization Out of the Box

Once the Data Warehouse Pack is installed, the agent generates a star schema that actually holds. fact_sales.sql uses on_schema_change='fail' and contract enforcement, so bad data breaks the pipeline before it hits the warehouse. dim_customer_scd2.sql implements proper Type 2 logic with is_current flags and effective_date ranges, preserving full history without manual merge hacks. Spark queries leverage DISTRIBUTE BY and bucketing hints, reducing shuffle volume by orders of magnitude.

The validate_project.sh script runs in CI, catching missing partition hints or malformed SCD columns before they reach production. You get a warehouse where facts are granular, dimensions are history-aware, and queries are optimized by design. For governance policies downstream, the data-lake-pack covers metadata cataloging, but the warehouse itself needs to be rock-solid. Testing SCD logic is critical; the dbt-analytics-pack provides CI/CD integration for deployments, and this pack ensures your models are correct before they get there. We built this so you don't have to reinvent the wheel for every project. Install the pack, and the agent handles the star schema, SCD Type 2, and Spark optimization patterns automatically.

What's in the Data Warehouse Pack

  • skill.md — Orchestrator skill that defines the data warehouse design workflow, references all templates, references, validators, and examples, and instructs the agent on when to apply star schema, SCD, partitioning, and optimization patterns.
  • templates/fact_sales.sql — Production-grade dbt incremental model for a sales fact table. Uses on_schema_change='fail', contract enforcement, and Spark partitioning hints for query optimization.
  • templates/dim_customer_scd2.sql — SCD Type 2 dimension model using dbt incremental materialization, window functions for history tracking, and audit columns. Implements surrogate keys and effective date logic.
  • templates/spark_optimization.sql — Curated Spark SQL patterns for partitioning, bucketing, storage partition joins, and ORC/Hive optimization. Includes DISTRIBUTE BY, REPARTITION hints, and Iceberg bucketing configs.
  • references/dimensional-modeling.md — Canonical knowledge on star schema design, Kimball methodology, SCD types (1, 2, 3), surrogate keys, audit columns, and denormalization trade-offs for analytical workloads.
  • references/spark-performance.md — Canonical knowledge on distributed partitioning, shuffle optimization, bucketing vs partitioning, storage partition joins, ORC filter pushdown, and foreachPartition streaming patterns.
  • scripts/validate_project.sh — Executable validator that scans a target directory for required DWH artifacts (fact/dim patterns, dbt config, Spark hints). Exits non-zero if critical patterns are missing or malformed.
  • validators/dbt_schema.yml — Production dbt schema.yml with enforced contracts, primary_key/not_null constraints, and data_tests. Validates that star schema models adhere to dimensional modeling standards.
  • tests/test_scd_type2.py — Python test script that validates SCD Type 2 logic against sample data. Uses assertions and exits 1 on failure to ensure historical tracking correctness.
  • examples/retail_warehouse_spec.yaml — Worked example: complete retail data warehouse design specification including table definitions, partitioning strategy, SCD handling, and query optimization notes.

Install and Ship

Stop guessing partition keys and manually coding SCD logic. Upgrade to Pro to install the Data Warehouse Pack. Your warehouse should be a star schema, not a flat file with delusions of grandeur. Install now and ship.

References

  1. Understand star schema and the importance for Power BI — learn.microsoft.com
  2. Performance guidelines in Fabric Data Warehouse — learn.microsoft.com
  3. Mastering Slowly Changing Dimensions (SCD) — datacamp.com
  4. Best Practices for Slowly Changing Dimensions — medium.com
  5. Slowly Changing Dimensions Explained: SCD Types 1-4 & ... — weld.app
  6. Implement a slowly changing dimension in Amazon Redshift — aws.amazon.com
  7. Slowly Changing Dimensions: Types 1-3 with Examples — dev.to
  8. Building slowly changing dimension on a Fact/ ... — dba.stackexchange.com

Frequently Asked Questions

How do I install Data Warehouse Pack?

Run `npx quanta-skills install data-warehouse-pack` in your terminal. The skill will be installed to ~/.claude/skills/data-warehouse-pack/ and automatically available in Claude Code, Cursor, Copilot, and other AI coding agents.

Is Data Warehouse Pack free?

Data Warehouse Pack is a Pro skill — $29/mo Pro plan. You need a Pro subscription to access this skill. Browse 37,000+ free skills at quantaintelligence.ai/skills.

What AI coding agents work with Data Warehouse Pack?

Data Warehouse Pack works with Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Warp, and any AI coding agent that reads skill files. Once installed, the agent automatically gains the expertise defined in the skill.