Runbook

Validating Schema Changes Against Baseline Metadata

Validating Schema Changes Against Baseline Metadata is a deterministic, CI-driven control plane requirement for any production database operating under strict SLOs. When DDL executes, the optimizer’s cost model recalculates cardinality estimates, access paths, and join orders. Without automated validation, silent regressions manifest as sudden latency spikes, elevated CPU/IO wait states, and plan cache thrashing. This runbook establishes a repeatable, threshold-gated workflow to intercept optimizer divergence before DDL reaches production. It operates as a core component of the Schema Validation for Baseline Metadata cluster and integrates directly with broader Automated EXPLAIN Capture & Storage Workflows to ensure plan stability across schema evolution.

Symptom Identification & Root Cause Analysis

Production incidents triggered by unvalidated DDL typically surface through three observable telemetry patterns. SREs must correlate these signals with deployment timestamps to isolate regression vectors.

  1. Plan Hash Divergence: The plan_hash or query_id for a critical workload shifts post-migration. This indicates the optimizer abandoned a previously stable execution strategy, often due to altered index visibility, constraint drops, or partition boundary changes.
  2. Cardinality Estimation Drift: Row estimates deviate by >30% from historical baselines. This forces nested loops to replace hash joins, triggers unnecessary materializations, or causes spill-to-disk operations under memory pressure.
  3. Cost Model Inflation: The estimated execution cost multiplies beyond acceptable bounds. Correlates strongly with missing statistics, altered column types, or implicit type casting introduced by schema modifications.

Root cause isolation follows a strict diagnostic hierarchy:

  • Layer 1: Verify DDL impact on indexed columns, partition schemes, and foreign key constraints.
  • Layer 2: Confirm ANALYZE/UPDATE STATISTICS execution post-migration. Stale histograms invalidate selectivity calculations.
  • Layer 3: Compare pre- and post-apply execution trees. If nullability, data types, or index visibility changed, the optimizer’s cost functions will shift, invalidating previously pinned or baseline-captured plans.

Deterministic Regression Thresholds

CI gating must enforce hard thresholds. Soft warnings are insufficient for SLO-bound systems. The following metrics define automatic pipeline failure:

MetricThresholdAction
plan_hash mismatch!= baselineBlock merge; trigger manual review
Cardinality drift>±30%> \pm 30\% per nodeBlock merge; require UPDATE STATISTICS validation
Estimated cost delta> +15%Block merge; require optimizer hint review
Buffer hit ratio drop< -10% vs baselineBlock merge; flag for index/IO review
Execution time (ANALYZE)> +20% p95Block merge; require query rewrite or schema rollback

Step-by-Step Validation Pipeline

The automation pipeline executes in four isolated stages. All operations run against a staging replica with identical hardware profiles, configuration parameters, and statistically representative data distributions.

1. Baseline Metadata & Plan Extraction

Capture the current schema state and associated execution plans for the top 50 critical queries. Store metadata as versioned JSON artifacts containing table definitions, index structures, constraint states, and optimizer statistics. Use a deterministic query set tagged via pg_stat_statements or equivalent telemetry.

PYTHON
import asyncpg
import json
from pathlib import Path

BASELINE_DIR = Path("baselines")
BASELINE_DIR.mkdir(exist_ok=True)

async def extract_baseline(dsn: str, query_ids: list[str]) -> dict:
    conn = await asyncpg.connect(dsn)
    baseline = {"queries": {}, "schema_version": await conn.fetchval("SELECT current_setting('schema_version')")}
    
    for qid in query_ids:
        plan_json = await conn.fetchval(f"EXPLAIN (FORMAT JSON) SELECT * FROM pg_stat_statements WHERE queryid = $1", qid)
        baseline["queries"][qid] = plan_json[0]
        
    await conn.close()
    BASELINE_DIR.joinpath(f"baseline_{baseline['schema_version']}.json").write_text(json.dumps(baseline, indent=2))
    return baseline

2. Sandbox DDL Application

Apply the proposed migration script to a cloned schema. Ensure work_mem, shared_buffers, and parallelism settings match production. Disable auto-vacuum and background statistics collection during the test window to eliminate noise.

SQL
-- Pre-migration isolation
ALTER SYSTEM SET autovacuum = off;
ALTER SYSTEM SET track_counts = off;
SELECT pg_reload_conf();

-- Apply DDL in transaction
BEGIN;
\i /migrations/2024_05_12_add_column_nullable.sql
COMMIT;

-- Post-apply stats refresh (mandatory for cost model accuracy)
ANALYZE VERBOSE;

3. EXPLAIN Capture & Comparison

Execute EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each baseline query. Capture actual execution metrics, not just estimates. Reference the official PostgreSQL EXPLAIN documentation for buffer and timing semantics.

PYTHON
async def capture_explain(dsn: str, query_sql: str) -> dict:
    conn = await asyncpg.connect(dsn)
    plan = await conn.fetchval(f"EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) {query_sql}")
    await conn.close()
    return plan[0]

4. Diff Evaluation & CI Gating

Compare captured plans against baseline artifacts using structural diffing. Focus on node type changes, cost inflation, and buffer utilization.

PYTHON
from deepdiff import DeepDiff
import sys

def evaluate_regression(baseline_plan: dict, new_plan: dict) -> bool:
    # Extract critical nodes
    b_root = baseline_plan.get("Plan", {})
    n_root = new_plan.get("Plan", {})
    
    # Threshold checks
    cost_ratio = n_root.get("Total Cost", 1) / b_root.get("Total Cost", 1)
    if cost_ratio > 1.15:
        print(f"[FAIL] Cost inflation: {cost_ratio:.2f}x")
        return True
        
    if b_root.get("Plan Hash") != n_root.get("Plan Hash"):
        print("[FAIL] Plan hash divergence detected")
        return True
        
    # Structural diff for node type changes
    diff = DeepDiff(b_root, n_root, ignore_order=True, exclude_paths=["root['Execution Time']", "root['Planning Time']"])
    if diff.get("values_changed") or diff.get("iterable_item_added"):
        print(f"[FAIL] Structural regression: {diff}")
        return True
        
    return False

Debugging Scenarios & Safe Fallback Chains

When validation fails, execute the following mitigation paths in strict order. Do not bypass CI gates without documented SRE approval.

ScenarioDiagnostic SignalMitigation Path
Index InvalidationSeq Scan replaces Index Scan post-DDLRebuild index concurrently; verify pg_index.indisvalid = true; re-run ANALYZE
Type Coercion OverheadFilter node with implicit cast appearsAdd explicit cast in query; adjust column type with USING clause; update query templates
Statistics StalenessCardinality drift >30% despite stable schemaRun ANALYZE with increased default_statistics_target; verify sample coverage; trigger manual stats refresh
Join Order FlipNested loop replaces hash join; work_mem spill detectedIncrease work_mem temporarily; add SET enable_nestloop = off for validation; evaluate partition pruning

Fallback Chain Protocol:

  1. Immediate Rollback: REVERT DDL via migration framework. Preserve schema version lock.
  2. Plan Pinning: If rollback is blocked by business requirements, inject optimizer hints (/*+ LEADING(...) */ or equivalent) to force baseline plan selection.
  3. Statistical Correction: Execute targeted ANALYZE with default_statistics_target scaled to 2x baseline. Re-validate.
  4. Escalation: If regression persists >15% after steps 1-3, escalate to query optimization engineers with full EXPLAIN (ANALYZE, BUFFERS) artifacts.

Observability Integration & Telemetry Routing

Validation pipelines must emit structured telemetry to enable post-deployment correlation. Route metrics through your existing observability stack using OpenTelemetry semantic conventions for database spans.

Required Metrics:

  • schema_validation.plan_hash_mismatch (Counter)
  • schema_validation.cardinality_drift_pct (Gauge)
  • schema_validation.cost_inflation_ratio (Gauge)
  • schema_validation.validation_duration_ms (Histogram)

CI/CD Integration:

YAML
# .gitlab-ci.yml / GitHub Actions equivalent
validate_schema:
  stage: test
  services:
    - postgres:15
  script:
    - python -m schema_validator --dsn $STAGING_DSN --baseline-dir ./baselines --threshold-config ./thresholds.json
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  artifacts:
    reports:
      junit: validation-results.xml

Alerting Rules:

  • schema_validation.plan_hash_mismatch > 0 → Page SRE on-call
  • schema_validation.cardinality_drift_pct > 30 → Slack warning + Jira ticket auto-creation
  • schema_validation.validation_duration_ms > 300000 → Timeout alert; investigate sandbox resource contention

Operational Runbook Summary

Validating Schema Changes Against Baseline Metadata transforms DDL deployment from a probabilistic risk into a deterministic control plane operation. By enforcing strict regression thresholds, capturing EXPLAIN (ANALYZE, BUFFERS) artifacts, and implementing automated fallback chains, platform teams eliminate silent optimizer regressions. Integrate this pipeline into your CI/CD gate, route telemetry to centralized dashboards, and maintain a versioned baseline repository. When schema evolution is predictable, query performance remains stable, and SLO violations become preventable rather than reactive.