Runbook

Setting Dynamic Thresholds for Query Regression Alerts

Static latency or execution-time thresholds fail in production because query performance is inherently non-linear. Data volume shifts, parameter sniffing, index fragmentation, and optimizer statistics updates all introduce variance that fixed alerting rules cannot absorb. Setting Dynamic Thresholds for Query Regression Alerts requires a baseline-driven, statistically adaptive approach that correlates plan stability with execution metrics. This runbook operationalizes the principles established in Core Architecture & Baselining Fundamentals and provides deterministic workflows for detection, triage, and safe override.

Telemetry Capture & Multi-Signal Correlation

Production regressions rarely manifest as isolated metric spikes. They appear as correlated deviations across execution profiles. To eliminate false positives, telemetry must be captured at the query-plan fingerprint level, not the raw statement level. Implement the following signal collection pipeline using OpenTelemetry database instrumentation and eBPF-based wait-state tracers:

  • Plan Hash Instability: Monitor plan_hash_value (SQL Server) or queryid/planid (PostgreSQL). Critical paths tolerate 0 drift. Any hash change triggers immediate baseline re-evaluation.
  • Logical Read Explosion: Track buffer_gets (Oracle/SQL Server) or shared_blks_hit + shared_blks_read (PostgreSQL). Threshold: >35% deviation from baseline on identical parameter distributions.
  • Execution Time Variance: Monitor p95_duration against a rolling 7-day median. Threshold: sustained crossing of 1.35x for 3\ge 3 consecutive execution windows (typically 5-minute intervals).
  • Row Examination Ratio: Calculate rows_examined / rows_returned. Threshold: >10:1 for OLTP point-lookups, >50:1 for analytical range scans.
  • Wait Profile Shift: Detect sudden dominance of PAGEIOLATCH, CPU, or LOCK waits where ASYNC_NETWORK_IO or NETWORK previously dominated.

Correlation Rule: Alerts fire only when 2\ge 2 signals breach thresholds within a 15-minute sliding window. Single-metric deviations route to a low-priority telemetry bucket for trend analysis, not incident response.

Dynamic Threshold Calculation Engine

Fixed multipliers ignore seasonal traffic patterns and gradual data growth. Replace them with an Exponentially Weighted Moving Average (EWMA) baseline paired with Median Absolute Deviation (MAD) for variance normalization. The following Python engine computes adaptive upper/lower bounds:

PYTHON
import numpy as np
import pandas as pd
from scipy.stats import median_abs_deviation

def compute_dynamic_thresholds(metric_series: pd.Series, 
                               ewma_alpha: float = 0.15, 
                               mad_multiplier: float = 2.5) -> dict:
    """
    Computes adaptive thresholds for query regression detection.
    metric_series: Time-indexed series of execution metrics (e.g., p95_duration_ms)
    """
    # 1. EWMA Baseline (smooths gradual data growth)
    ewma_baseline = metric_series.ewm(alpha=ewma_alpha, adjust=False).mean()
    
    # 2. Rolling MAD (robust to outliers, unlike standard deviation)
    rolling_mad = metric_series.rolling(window=1008).apply(median_abs_deviation, raw=True)
    
    # 3. Dynamic Upper Bound (adaptive ceiling)
    upper_bound = ewma_baseline + (mad_multiplier * rolling_mad)
    
    # 4. Hysteresis buffer prevents alert flapping during recovery
    lower_bound = ewma_baseline - (0.5 * rolling_mad)
    
    return {
        "baseline": ewma_baseline.iloc[-1],
        "upper_threshold": upper_bound.iloc[-1],
        "lower_threshold": lower_bound.iloc[-1],
        "current_variance_pct": ((metric_series.iloc[-1] - ewma_baseline.iloc[-1]) / ewma_baseline.iloc[-1]) * 100
    }

Deploy this engine as a sidecar or scheduled Lambda/Cloud Function that ingests metrics from your time-series database. Persist computed thresholds in a version-controlled configuration store to enable auditability and rollback.

Deterministic Root Cause Analysis

When dynamic thresholds trigger, execute the following triage sequence to isolate degradation vectors. Do not proceed to remediation until the root cause is classified.

Step 1: Verify Plan Fingerprint

Compare the current plan_hash against the registry baseline. A mismatch confirms optimizer re-evaluation (join order change, scan type flip, or parallelism adjustment).

Step 2: Validate Statistics Freshness

Stale column statistics on high-cardinality fields frequently force nested-loop joins over hash joins.

PostgreSQL:

SQL
SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'target_table'
ORDER BY n_mod_since_analyze DESC;

SQL Server:

SQL
SELECT name, last_updated, modification_counter, rows
FROM sys.stats s
JOIN sys.dm_db_stats_properties(s.object_id, s.stats_id) sp ON s.object_id = sp.object_id
WHERE object_id = OBJECT_ID('dbo.target_table');

Mitigation: Trigger ANALYZE or UPDATE STATISTICS with FULLSCAN if n_mod_since_analyze > 10% of total rows.

Step 3: Evaluate Parameter Sniffing

Compare cached plan parameters against current execution parameters. A >20% deviation in filtered column selectivity invalidates cardinality estimates.

Mitigation: Implement OPTION (RECOMPILE) (SQL Server) or plan_cache_mode = force_generic_plan (PostgreSQL) for highly skewed parameter sets. Alternatively, use query store plan forcing.

Step 4: Correlate Infrastructure Contention

Cross-reference CPU_ready, I/O latency, and memory pressure with the execution window. Use Prometheus Recording Rules to pre-aggregate node-level resource saturation. If infrastructure metrics breach 85% utilization, classify the regression as environmental, not query-level. Route to platform capacity alerts instead of query optimization queues.

Observability Integration & Alert Routing

Wire the threshold engine into your alerting pipeline using multi-signal correlation and hysteresis to prevent alert fatigue.

Prometheus Alertmanager Configuration:

YAML
groups:
  - name: query_regression_dynamic
    rules:
      - record: job:query_p95_duration:ewma
        expr: ewma_over_time(query_p95_duration_ms[7d], 0.15)
      - alert: QueryPlanRegressionDetected
        expr: |
          (query_logical_reads > job:query_logical_reads:ewma * 1.35)
          and
          (query_p95_duration_ms > job:query_p95_duration:ewma * 1.35)
          and
          (count_over_time(query_plan_hash_changes[15m]) > 0)
        for: 5m
        labels:
          severity: critical
          team: db-sre
        annotations:
          summary: "Dynamic threshold breach: "
          description: "Plan regression detected. Execute RCA runbook. Baseline: ms"

Route alerts to PagerDuty/Opsgenie with a 15-minute cooldown and require explicit acknowledgment before auto-escalation. Tag alerts with query_id, plan_hash, and threshold_type to enable automated runbook attachment.

Safe Fallback Chains & CI/CD Automation

Never allow automated threshold detection to trigger untested schema changes or aggressive query rewrites. Implement a tiered fallback chain that prioritizes plan stability over aggressive optimization.

Fallback Hierarchy

  1. Plan Guide / SPM Forcing: Lock the known-good execution plan using SQL Server Query Store or PostgreSQL pg_hint_plan.
  2. Parameterized Query Rewrite: Force generic plans or add OPTIMIZE FOR UNKNOWN to neutralize sniffing.
  3. Targeted Index Maintenance: Rebuild fragmented indexes (>30% fragmentation) or add covering indexes for high rows_examined queries.
  4. Compute Scale-Out: Last resort. Add read replicas or increase memory allocation only after plan-level mitigations fail.

CI/CD Gate Integration

Integrate threshold validation into your deployment pipeline. Block merges that introduce plan regressions beyond 1.2x baseline in staging.

PYTHON
# ci_regression_gate.py
import requests
import sys

def validate_plan_regression(query_id: str, threshold_url: str) -> bool:
    """Blocks CI/CD if staging execution exceeds dynamic thresholds."""
    resp = requests.get(f"{threshold_url}/api/v1/thresholds/{query_id}")
    data = resp.json()
    
    if data["current_variance_pct"] > 20.0:
        print(f"BLOCKED: Query {query_id} exceeds dynamic threshold by {data['current_variance_pct']:.1f}%")
        print(f"Action: Attach plan guide or optimize join predicates before merge.")
        return False
    return True

if __name__ == "__main__":
    if not validate_plan_regression(sys.argv[1], sys.argv[2]):
        sys.exit(1)

For production emergencies, maintain a plan_override_registry with time-bound TTLs. Auto-expire forced plans after 72 hours and re-evaluate against fresh statistics. Document every override in the incident post-mortem to refine baseline calculations.