Validating Schema Changes Against Baseline Metadata

Validating schema changes against baseline metadata is a deterministic, CI-driven control that intercepts optimizer divergence before DDL reaches production. When DDL executes, the optimizer’s cost model recalculates cardinality estimates, access paths, and join orders — and without an automated gate, silent regressions surface as latency spikes, elevated CPU/IO wait, and plan-cache thrashing. This runbook establishes a repeatable, threshold-gated workflow that captures a schema-versioned baseline, replays the proposed migration on a staging replica, and blocks the merge when the captured plan drifts. It is the enforcement side of Schema Validation for Baseline Metadata and consumes the canonical plan envelopes produced by Normalizing Query Plans for Cross-Engine Comparison within the broader Automated EXPLAIN Capture & Storage Workflows pipeline.

Symptom Identification & Production Thresholds

Unvalidated DDL surfaces through observable telemetry that must be correlated with deployment timestamps to isolate the regression vector. CI gating enforces hard thresholds — soft warnings are insufficient for SLO-bound systems. The following breach conditions each trigger an automatic pipeline failure:

Plan hash divergence — the normalized plan_hash for a critical workload differs from baseline post-migration. This indicates the optimizer abandoned a previously stable strategy, usually from altered index visibility, dropped constraints, or partition-boundary changes. Any mismatch (!= baseline) blocks the merge.
Cardinality estimation drift — a per-node row estimate deviates by $> \pm 30\%$ from the stored baseline, forcing nested loops to replace hash joins or triggering spill-to-disk under memory pressure.
Cost-model inflation — the estimated Total Cost for the root node rises by > +15% versus baseline, correlating with missing statistics, altered column types, or implicit casting introduced by the change.
Buffer hit-ratio drop — shared-buffer hit ratio falls by < -10% versus baseline, signalling that a previously index-served access path now reads from disk.
Execution-time regression — EXPLAIN (ANALYZE) p95 wall time rises by > +20%, requiring a query rewrite or schema rollback before merge.

Metric	Threshold	Gate action
`plan_hash` mismatch	`!= baseline`	Block merge; trigger manual review
Cardinality drift	$> \pm 30\%$ per node	Block merge; require `UPDATE STATISTICS`/`ANALYZE` validation
Estimated cost delta	`> +15%`	Block merge; require optimizer-hint review
Buffer hit-ratio drop	`< -10%` vs baseline	Block merge; flag for index/IO review
Execution time (ANALYZE)	`> +20%` p95	Block merge; require query rewrite or rollback

These numeric bands should be defined once and shared with the regression thresholds engine so that schema gating and runtime alerting evaluate divergence identically; teams running variable traffic can source them from dynamic threshold tuning instead of static constants.

Root Cause Analysis

Root-cause isolation follows a strict diagnostic hierarchy. Each failure domain below carries the exact command that confirms or eliminates it.

Index invalidation. A Seq Scan replaces an Index Scan because the migration dropped, renamed, or invalidated an index the plan depended on. Confirm validity before assuming a cost problem:

SQL

-- Any invalid index silently forces sequential scans
SELECT c.relname AS index, i.indisvalid, i.indisready
FROM pg_index i JOIN pg_class c ON c.oid = i.indexrelid
WHERE NOT (i.indisvalid AND i.indisready);

Expected clean output: (0 rows). Any row here is the regression source.

Statistics staleness. DDL that rewrites a table (ALTER TABLE ... TYPE, ADD COLUMN ... DEFAULT) invalidates histograms, so selectivity math collapses. Check when each relation was last analyzed:

SQL

SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 0
ORDER BY n_mod_since_analyze DESC;

A high n_mod_since_analyze with an old last_analyze means the plan was costed against stale stats — run ANALYZE before re-evaluating.

Type-coercion overhead. A new implicit cast introduces a Filter node and defeats index usage. Inspect the plan for a cast on the predicate column with the \gexec-free one-liner:

BASH

psql "$STAGING_DSN" -c "EXPLAIN (VERBOSE) SELECT * FROM orders WHERE id = '42';" | grep -i 'Filter\|::'

If the output shows (id)::text = '42'::text, the column type changed under the query template.

Join-order flip. A nested loop replaces a hash join and spills work_mem. Compare pre/post plan shape directly; the deterministic plan_hash derived from the plan hashing algorithm is the authoritative signal that the tree changed, since PostgreSQL exposes no built-in plan-hash field in EXPLAIN (FORMAT JSON).

Step-by-Step Remediation

The pipeline runs in four isolated stages against a staging replica with production-identical hardware, work_mem, shared_buffers, parallelism settings, and a statistically representative data distribution. All code is asyncio-based with structlog and OpenTelemetry instrumentation so each run emits a correlatable trace.

1. Capture the schema-versioned baseline

Extract the current schema state and EXPLAIN (FORMAT JSON) plans for the top 50 critical queries, tagged with the migration version.

PYTHON

import asyncpg
import json
import structlog
from pathlib import Path
from opentelemetry import trace

log = structlog.get_logger("schema_validator")
tracer = trace.get_tracer("schema_validator")
BASELINE_DIR = Path("baselines")
BASELINE_DIR.mkdir(exist_ok=True)


async def extract_baseline(dsn: str, query_texts: dict[str, str]) -> dict:
    """query_texts maps query_id -> SQL. Stores EXPLAIN JSON per query
    alongside the current schema version tag."""
    with tracer.start_as_current_span("extract_baseline") as span:
        conn = await asyncpg.connect(dsn)
        try:
            # Track version via the migration table, not current_setting(),
            # which requires the GUC to be pre-loaded via SET/ALTER DATABASE.
            version = await conn.fetchval("SELECT MAX(version) FROM schema_migrations")
            span.set_attribute("db.schema_version", str(version))
            baseline = {"schema_version": str(version), "queries": {}}
            for qid, sql in query_texts.items():
                rows = await conn.fetch(f"EXPLAIN (FORMAT JSON) {sql}")
                baseline["queries"][qid] = rows[0][0]  # EXPLAIN JSON is a 1-row list
            log.info("baseline_captured", version=version, queries=len(query_texts))
        finally:
            await conn.close()
    (BASELINE_DIR / f"baseline_{version}.json").write_text(json.dumps(baseline, indent=2))
    return baseline

Expected log line: baseline_captured version=4821 queries=50.

2. Apply the proposed DDL in the sandbox

Apply the migration to the cloned schema, then refresh statistics. Disable autovacuum and counters during the window to eliminate noise.

SQL

-- Pre-migration isolation on the staging replica
ALTER SYSTEM SET autovacuum = off;
ALTER SYSTEM SET track_counts = off;
SELECT pg_reload_conf();

BEGIN;
\i /migrations/2026_05_12_add_column_nullable.sql
COMMIT;

-- Mandatory: cost-model accuracy depends on fresh histograms
ANALYZE VERBOSE;

3. Capture and diff the post-DDL plans

Replay EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each baseline query. Capturing on the replica keeps this off the production hot path — the same isolation principle used when capturing EXPLAIN plans without impacting production performance. See the PostgreSQL EXPLAIN documentation for buffer and timing semantics.

PYTHON

from deepdiff import DeepDiff


async def capture_plan(dsn: str, sql: str) -> dict:
    conn = await asyncpg.connect(dsn)
    try:
        rows = await conn.fetch(f"EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) {sql}")
    finally:
        await conn.close()
    return rows[0][0][0]["Plan"]  # root Plan node


def evaluate_regression(baseline_plan: dict, new_plan: dict) -> bool:
    """Returns True if the new plan breaches a gate threshold."""
    with tracer.start_as_current_span("evaluate_regression") as span:
        cost_ratio = new_plan.get("Total Cost", 1) / max(
            baseline_plan.get("Total Cost", 1), 0.001)
        span.set_attribute("plan.cost_ratio", cost_ratio)
        if cost_ratio > 1.15:  # +15% cost inflation gate
            log.warning("cost_inflation", ratio=round(cost_ratio, 2))
            return True
        diff = DeepDiff(
            baseline_plan, new_plan, ignore_order=True,
            exclude_paths=[
                "root['Actual Total Time']", "root['Actual Rows']",
                "root['Actual Loops']", "root['Planning Time']",
            ],
        )
        if any("Node Type" in k for k in diff.get("values_changed", {})):
            log.warning("node_type_regression", diff=str(diff.get("values_changed")))
            return True
    return False

Expected: False for a clean migration; a cost_inflation ratio=1.34 warning blocks the run when the optimizer regresses.

4. Wire the CI gate

Fail the pipeline on any breach so the DDL cannot merge. Route the exact same cost-delta math used at runtime by cost-estimation mapping so staging and production agree.

YAML

validate_schema:
  stage: test
  services:
    - postgres:17
  script:
    - python -m schema_validator --dsn "$STAGING_DSN"
        --baseline-dir ./baselines --threshold-config ./thresholds.json
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  artifacts:
    reports:
      junit: validation-results.xml

When a gate fails, walk the fallback chain in strict order and never bypass a gate without documented approval: (1) roll the DDL back through the migration framework, preserving the schema-version lock; (2) if rollback is blocked by business requirements, pin the plan with an optimizer hint (pg_hint_plan /*+ Leading(...) */) to force baseline selection; (3) run a targeted ANALYZE with default_statistics_target scaled to 2× baseline and re-validate; (4) if regression persists > +15%, escalate to query-optimization engineers with the full EXPLAIN (ANALYZE, BUFFERS) artifacts. Accepted baselines are then persisted under the rules in Security Boundaries for Baseline Data Storage.

Verification Checklist

Run these after every gated migration before promoting the schema version:

[ ] pg_index reports zero rows where NOT (indisvalid AND indisready).
[ ] ANALYZE VERBOSE completed and pg_stat_user_tables.last_analyze is newer than the DDL apply timestamp.
[ ] Root-node Total Cost delta is <= +15% for all 50 baseline queries.
[ ] No per-node cardinality estimate drifted beyond $\pm 30\%$ .
[ ] Buffer hit ratio held within 10% of baseline on the replay.
[ ] The recomputed plan_hash matches baseline, or the mismatch is triaged and signed off.
[ ] CI validation-results.xml shows all threshold assertions green and the merge gate is unblocked.

Compatibility & Engine-Specific Notes

The pipeline’s shape is portable, but the plan primitives differ by engine. Normalize these differences before diffing so a cross-engine baseline never false-positives.

Concern	PostgreSQL	MySQL / MariaDB	Distributed SQL (CockroachDB / Yugabyte)
Plan format	`EXPLAIN (FORMAT JSON)`	`EXPLAIN FORMAT=JSON`	`EXPLAIN (FORMAT JSON)` (PG-compatible dialect)
Cost field	`Total Cost` (arbitrary units)	`query_cost` (float string)	per-node `estimatedRowCount` + latency estimate
Stats refresh	`ANALYZE`	`ANALYZE TABLE`	`CREATE STATISTICS` / auto-stats jobs
Built-in plan hash	none — compute SHA-256 of normalized tree	none — normalize `query_block` first	partial (`plan gist`) — still normalize
Index validity check	`pg_index.indisvalid`	`information_schema.STATISTICS`	`SHOW INDEXES` / `crdb_internal` tables

Because no engine exposes a stable, comparable plan hash natively, the deterministic plan-hash generation step is mandatory for cross-engine baselines: it strips volatile fields and canonicalizes the tree so plan_hash mismatches mean structural change, not formatting noise.

← Back to Schema Validation for Baseline Metadata
Automated EXPLAIN Capture & Storage Workflows — the parent capture-and-store pipeline
Defining Regression Thresholds for Query Plans — source the numeric gate bands
Plan Hashing Algorithms for SQL Engines — how the plan_hash divergence signal is computed
Capturing EXPLAIN Plans Without Impacting Production Performance — safe replay on replicas

Symptom Identification & Production Thresholds #

Root Cause Analysis #

Step-by-Step Remediation #

1. Capture the schema-versioned baseline #

2. Apply the proposed DDL in the sandbox #

3. Capture and diff the post-DDL plans #

4. Wire the CI gate #

Verification Checklist #

Compatibility & Engine-Specific Notes #

Related #