Tracking Cost Deltas Across Baseline Versions

This stage owns one job inside the regression pipeline: comparing the optimizer-estimated cost of a candidate execution plan against the cost anchored in a specific baseline version, and emitting a deterministic, normalized cost delta — never touching runtime latency, buffer telemetry, or business impact. A cost that jumps from one baseline version to the next is often the first quantitative symptom of statistics drift, a cost-model change after a major-version upgrade, or a regressed access path, and measuring it here — between plan capture and policy routing — gives Database SREs a numeric signal that fires before p95 latency ever moves.

Architectural Boundaries

The cost-delta comparator is a pure, stateless transformation node inside the broader Regression Detection & Rule Engines subsystem. It has exactly one upstream contract and one downstream contract, and it deliberately refuses to do anything outside them.

Upstream (consumes): normalized, immutable plan artifacts produced by the Automated EXPLAIN Capture & Storage Workflows pipeline. Each artifact is the output of Normalizing Query Plans for Cross-Engine Comparison — a canonical operator tree keyed by the fingerprint from Plan Hashing Algorithms for SQL Engines. The comparator requires a versioned baseline plan reference, a candidate plan reference, the optimizer cost metrics exposed on each (total_cost, startup_cost, rows, width in PostgreSQL; Cost and Rows in SQL Server; the mapped equivalents from Cost Estimation Mapping Across PostgreSQL and MySQL), and a schema_snapshot_hash. It never re-parses raw EXPLAIN text — that responsibility belongs strictly upstream.

Downstream (emits): a single structured CostDeltaPayload per comparison. That payload carries the absolute and percentage delta, a deterministic routing flag (STABLE, DRIFT, or REGRESSION_THRESHOLD_EXCEEDED), a SHA-256 hash of the comparison context, and a compact metadata envelope. It is published to the rule engine, which correlates the cost signal with the structural signal from Detecting Join Type Shifts in Execution Plans and the access-path signal from Monitoring Index Usage Changes for Regression Signals before any WARN/BLOCK verdict is reached. The comparator itself never blocks a deploy, opens a ticket, rewrites a query, or recommends an index.

This isolation is load-bearing. Because the stage evaluates only cost metrics against a pinned baseline version, it is idempotent, safe to run in parallel across thousands of query fingerprints, and free of runtime side effects. Quantifying drift and attributing root cause stay decoupled: this node answers “by how much did the estimated cost move between these two baseline versions?” and nothing else.

Deterministic Routing and Schema Enforcement

The comparison must be reproducible bit-for-bit. Cost models are sensitive to environmental fluctuation — buffer-pool state, concurrent workload pressure, transient planner heuristics — so before any arithmetic the stage strips non-deterministic metadata (timestamps, session IDs, cache-state counters) and compares only the anchored cost fields. Two captures of the same logical plan against the same baseline version must produce byte-identical payloads.

Routing is a pure function of the normalized percentage delta. The comparator maps pct_delta onto three deterministic states, and only these three:

STABLE — $pct_{delta} \le 0.05$ (≤ 5%). Optimizer variance within expected noise. Archived for trend analysis; no alert.
DRIFT — $0.05 < pct_{delta} \le 0.15$ (5–15%). Meaningful re-evaluation. Routed to the rule engine for correlation before a review is scheduled.
REGRESSION_THRESHOLD_EXCEEDED — pct_delta > 0.15 (> 15%). High-probability degradation. Forwarded to the gating controller for BLOCK/quarantine/rollback consideration.

Every emitted payload conforms to a versioned JSON Schema so downstream consumers parse results without conditional branching:

JSON

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://queryplan.org/schemas/cost-delta-payload/v1.json",
  "type": "object",
  "required": [
    "query_fingerprint", "baseline_version", "candidate_hash",
    "absolute_delta", "percentage_delta", "routing_flag", "context_hash"
  ],
  "properties": {
    "query_fingerprint": { "type": "string", "pattern": "^[0-9a-f]{64}$" },
    "baseline_version":  { "type": "integer", "minimum": 1 },
    "candidate_hash":    { "type": "string", "pattern": "^[0-9a-f]{64}$" },
    "absolute_delta":    { "type": "number" },
    "percentage_delta":  { "type": "number" },
    "routing_flag": {
      "type": "string",
      "enum": ["STABLE", "DRIFT", "REGRESSION_THRESHOLD_EXCEEDED",
               "BASELINE_MISSING", "SCHEMA_MISMATCH"]
    },
    "context_hash":      { "type": "string", "pattern": "^[0-9a-f]{64}$" },
    "metadata":          { "type": "object" }
  },
  "additionalProperties": false
}

The partition key for the output topic is derived deterministically so that every comparison for a given query lands on the same partition and preserves version ordering:

partition = crc32( query_fingerprint || ":" || (baseline_version mod PARTITION_STRIDE) ) mod TOPIC_PARTITIONS

Keying on the fingerprint (not the candidate hash) keeps the full version history of one query in a single ordered stream, which is what lets consumers reconstruct a per-version cost trajectory without a global sort. The context_hash — sha256(baseline_total_cost : candidate_total_cost : baseline_schema_hash : candidate_schema_hash) — is the idempotency token: replaying the same two artifacts yields the same key, so duplicate deliveries collapse to a single logical result.

Production-Ready Implementation

The reference comparator is an async consumer: it pulls candidate artifacts off the broker, loads the anchored baseline version from the plan store over an asyncpg pool, computes the normalized delta, and emits a schema-valid payload with structlog and OpenTelemetry instrumentation. It is stateless and side-effect-free on the happy path — the only I/O is a read of the baseline row and a publish of the result.

PYTHON

import asyncio
import hashlib
import json
from dataclasses import dataclass, field
from typing import Any

import asyncpg
import structlog
from opentelemetry import metrics, trace

log = structlog.get_logger("cost_delta.comparator")
tracer = trace.get_tracer("cost_delta.comparator")
_meter = metrics.get_meter("cost_delta.comparator")

COST_DELTA_PCT = _meter.create_histogram(
    "cost_delta_percent", unit="ratio",
    description="Normalized optimizer cost delta between baseline and candidate",
)
ROUTING_FLAG_COUNT = _meter.create_counter(
    "routing_flag_count", description="Comparisons emitted per routing flag",
)
COMPARISON_MS = _meter.create_histogram(
    "comparison_duration_ms", unit="ms", description="Wall time of one comparison",
)

# Exact, non-negotiable routing bands. Tune per workload class via env, never
# leave as "configure as needed".
STABLE_THRESHOLD = 0.05
DRIFT_THRESHOLD = 0.15
# Version-aware scaling for optimizer cost-unit changes across major releases.
# Calibrate empirically per engine version; identity (1.0) is the safe default.
COST_SCALING = {"14": 1.0, "15": 1.02, "16": 1.05}


@dataclass(frozen=True)
class PlanCostMetrics:
    total_cost: float
    startup_cost: float
    estimated_rows: float
    schema_hash: str
    db_version: str


@dataclass(frozen=True)
class CostDeltaPayload:
    query_fingerprint: str
    baseline_version: int
    candidate_hash: str
    absolute_delta: float
    percentage_delta: float
    routing_flag: str
    context_hash: str
    metadata: dict[str, Any] = field(default_factory=dict)


def _normalize_cost(raw_cost: float, db_version: str) -> float:
    return raw_cost * COST_SCALING.get(db_version, 1.0)


def _context_hash(baseline: PlanCostMetrics, candidate: PlanCostMetrics) -> str:
    payload = (
        f"{baseline.total_cost}:{candidate.total_cost}"
        f":{baseline.schema_hash}:{candidate.schema_hash}"
    )
    return hashlib.sha256(payload.encode()).hexdigest()


def _route(pct_delta: float) -> str:
    if pct_delta <= STABLE_THRESHOLD:
        return "STABLE"
    if pct_delta <= DRIFT_THRESHOLD:
        return "DRIFT"
    return "REGRESSION_THRESHOLD_EXCEEDED"


def compare(
    fingerprint: str,
    baseline_version: int,
    candidate_hash: str,
    baseline: PlanCostMetrics,
    candidate: PlanCostMetrics,
) -> CostDeltaPayload:
    """Pure, deterministic comparison. No I/O, no side effects."""
    if baseline.schema_hash != candidate.schema_hash:
        return CostDeltaPayload(
            fingerprint, baseline_version, candidate_hash,
            absolute_delta=0.0, percentage_delta=0.0,
            routing_flag="SCHEMA_MISMATCH",
            context_hash=_context_hash(baseline, candidate),
            metadata={"reason": "schema_snapshot_hash mismatch"},
        )

    b_cost = _normalize_cost(baseline.total_cost, baseline.db_version)
    c_cost = _normalize_cost(candidate.total_cost, candidate.db_version)
    if b_cost == 0:
        raise ValueError("Baseline cost is zero; cannot compute a meaningful delta.")

    abs_delta = c_cost - b_cost
    pct_delta = abs_delta / b_cost

    return CostDeltaPayload(
        query_fingerprint=fingerprint,
        baseline_version=baseline_version,
        candidate_hash=candidate_hash,
        absolute_delta=round(abs_delta, 4),
        percentage_delta=round(pct_delta, 4),
        routing_flag=_route(pct_delta),
        context_hash=_context_hash(baseline, candidate),
        metadata={
            "baseline_version": baseline.db_version,
            "candidate_version": candidate.db_version,
        },
    )


async def _load_baseline(
    pool: asyncpg.Pool, fingerprint: str, version: int
) -> PlanCostMetrics | None:
    row = await pool.fetchrow(
        """
        SELECT total_cost, startup_cost, estimated_rows, schema_hash, db_version
          FROM plan_baselines
         WHERE query_fingerprint = $1 AND baseline_version = $2
        """,
        fingerprint, version,
    )
    if row is None:
        return None
    return PlanCostMetrics(**dict(row))


async def handle_artifact(pool: asyncpg.Pool, artifact: dict[str, Any], emit) -> None:
    fingerprint = artifact["query_fingerprint"]
    version = artifact["baseline_version"]
    loop = asyncio.get_running_loop()
    started = loop.time()

    with tracer.start_as_current_span("cost_delta.compare") as span:
        span.set_attribute("query_fingerprint", fingerprint)
        span.set_attribute("baseline_version", version)

        baseline = await _load_baseline(pool, fingerprint, version)
        if baseline is None:
            payload = CostDeltaPayload(
                fingerprint, version, artifact["candidate_hash"],
                0.0, 0.0, "BASELINE_MISSING", "",
                metadata={"reason": "no baseline for version"},
            )
            log.warning("baseline_missing", fingerprint=fingerprint, version=version)
        else:
            candidate = PlanCostMetrics(**artifact["candidate_metrics"])
            payload = compare(
                fingerprint, version, artifact["candidate_hash"], baseline, candidate
            )
            COST_DELTA_PCT.record(payload.percentage_delta,
                                  {"routing_flag": payload.routing_flag})

        span.set_attribute("routing_flag", payload.routing_flag)
        ROUTING_FLAG_COUNT.add(1, {"routing_flag": payload.routing_flag})
        COMPARISON_MS.record((loop.time() - started) * 1000.0)

        level = (log.warning if payload.routing_flag == "REGRESSION_THRESHOLD_EXCEEDED"
                 else log.info)
        level("cost_delta_emitted", fingerprint=fingerprint,
              routing_flag=payload.routing_flag,
              percentage_delta=payload.percentage_delta)

        await emit(json.dumps(payload.__dict__, sort_keys=True).encode())

Logs must never carry raw query text or bind literals — only the fingerprint, versions, and routing flag — so the stage is safe to run against production traffic without leaking PII into the telemetry backend.

Threshold Table

Thresholds are tuned per workload class: OLTP paths demand tighter bounds because a modest estimated-cost move often precedes a real latency regression, while analytical pipelines tolerate more variance from dynamic partition pruning and parallelism. These are the exact numerics the reference implementation ships with — none are “configure as needed”. Set the baseline policy in Defining Regression Thresholds for Query Plans before enabling any CI gate that consumes these flags.

Metric	Pass (`STABLE`)	Warn (`DRIFT`)	Block (`REGRESSION_THRESHOLD_EXCEEDED`)	Automation trigger
`percentage_delta` (OLTP)	≤ 0.05	0.05 – 0.10	> 0.10	Route Block to CI gate
`percentage_delta` (analytical)	≤ 0.08	0.08 – 0.20	> 0.20	Route Block to CI gate
`comparison_duration_ms` p95	≤ 25 ms	25 – 75 ms	> 75 ms	Page platform on-call
Consumer lag (broker)	≤ 500 msgs	500 – 5 000	> 5 000	Scale replicas
`BASELINE_MISSING` rate	≤ 0.5%	0.5 – 2%	> 2%	Trigger baseline backfill

Alerting binds exclusively to the REGRESSION_THRESHOLD_EXCEEDED counter and the p95/lag SLOs; DRIFT feeds dashboards for capacity planning, not pagers:

YAML

groups:
  - name: cost-delta-comparator
    rules:
      - alert: CostRegressionThresholdExceeded
        expr: increase(routing_flag_count{routing_flag="REGRESSION_THRESHOLD_EXCEEDED"}[10m]) > 0
        for: 0m
        labels: { severity: page }
        annotations:
          summary: "Cost delta exceeded regression threshold for "
      - alert: CostDeltaComparatorSlow
        expr: histogram_quantile(0.95, rate(comparison_duration_ms_bucket[5m])) > 75
        for: 10m
        labels: { severity: ticket }
        annotations:
          summary: "Comparator p95 latency above 75ms SLO"

Failure Scenarios and Root Cause Analysis

Deterministic comparison assumes parseable, structurally compatible, correctly versioned artifacts. When an assumption breaks, the stage degrades safely and stays observable rather than halting the pipeline.

Schema-snapshot mismatch. Symptom: payloads emit SCHEMA_MISMATCH with zero delta; the rule engine sees no cost signal for that fingerprint. Root cause: the candidate was captured against DDL that never anchored a baseline. Diagnose: SELECT baseline_version, schema_hash FROM plan_baselines WHERE query_fingerprint = '<fp>' ORDER BY baseline_version; and compare against the candidate’s schema_snapshot_hash. Mitigate: re-anchor a baseline for the new schema before re-enqueuing; the comparator will not fabricate a delta across incompatible schema state.
Missing or corrupt baseline. Symptom: rising BASELINE_MISSING rate; new query fingerprints never reach STABLE. Root cause: baseline backfill lag or a partial write to plan_baselines. Diagnose: SELECT count(*) FROM plan_baselines WHERE total_cost IS NULL; to find corrupt rows. Mitigate: the payload routes to a quarantine queue for manual onboarding rather than throwing; run the backfill job and let quarantine drain.
Cost-model version drift. Symptom: an entire fleet of fingerprints shifts to DRIFT/REGRESSION_THRESHOLD_EXCEEDED immediately after a major-version upgrade. Root cause: the engine changed internal cost units, so raw total_cost is not comparable across versions. Diagnose: SELECT db_version, count(*) FROM plan_baselines GROUP BY db_version; to confirm a version boundary. Mitigate: calibrate COST_SCALING for the new release; if the version is unrecognized, the stage logs a WARN, bypasses scaling, and flags the comparison for manual review — it does not silently trust an uncalibrated unit.
Aggregate delta masking a localized regression. Symptom: STABLE at the top level while a single high-cardinality table’s operator regressed. Root cause: raw aggregate total_cost dilutes a large shift on one relation with small offsets elsewhere. Diagnose: re-run under table-weighted normalization. Mitigate: apply the methodology in Calculating Weighted Cost Deltas for Multi-Table Queries, where weights derive from historical row estimates and index selectivity, not live execution metrics.
Structural divergence between compared plans. Symptom: node-level alignment fails because the candidate introduced new join algorithms or reordered relations. Root cause: a legitimate optimizer improvement, not a regression. Mitigate: the stage aborts node diffing, falls back to aggregate total_cost, sets metadata["structural_mismatch"] = true, and routes to DRIFT regardless of magnitude so a real improvement is never misread as a failure — the structural question is handed to Detecting Join Type Shifts in Execution Plans.

Configuration Reference

Every band and knob is an environment variable so a workload class can retune without a redeploy; the defaults below are what the reference comparator ships with.

Env var	Default	Purpose
`COSTDELTA_STABLE_THRESHOLD`	`0.05`	Upper bound of the `STABLE` band (fraction)
`COSTDELTA_DRIFT_THRESHOLD`	`0.15`	Upper bound of the `DRIFT` band; above it is `REGRESSION_THRESHOLD_EXCEEDED`
`COSTDELTA_SCALING_MAP`	`14:1.0,15:1.02,16:1.05`	Version-aware cost-unit scaling coefficients
`COSTDELTA_POOL_MIN`	`4`	Minimum `asyncpg` pool connections per worker
`COSTDELTA_POOL_MAX`	`16`	Maximum pool connections; cap below replica `max_connections`
`COSTDELTA_INPUT_TOPIC`	`candidate-plan-artifacts`	Broker topic of incoming candidate artifacts
`COSTDELTA_OUTPUT_TOPIC`	`cost-delta-events`	Output topic for emitted `CostDeltaPayload`s
`COSTDELTA_TOPIC_PARTITIONS`	`24`	Partition count used in the partition-key modulus
`COSTDELTA_PARTITION_STRIDE`	`100`	Version stride folded into the partition key
`COSTDELTA_QUARANTINE_TOPIC`	`baseline-missing-quarantine`	Destination for `BASELINE_MISSING` payloads
`COSTDELTA_OTEL_ENDPOINT`	—	OTLP exporter endpoint for spans and metrics

← Back to Regression Detection & Rule Engines (parent topic)
Deeper in this area: Calculating Weighted Cost Deltas for Multi-Table Queries
Correlate cost with structure: Detecting Join Type Shifts in Execution Plans
Correlate cost with access paths: Monitoring Index Usage Changes for Regression Signals
Suppress noisy deltas: Tuning Thresholds for False Positive Reduction
Cost-unit foundations: Cost Estimation Mapping Across PostgreSQL and MySQL

Architectural Boundaries #

Deterministic Routing and Schema Enforcement #

Production-Ready Implementation #

Threshold Table #

Failure Scenarios and Root Cause Analysis #

Configuration Reference #

Related #