Guide

Tuning Thresholds for False Positive Reduction

In production query plan tracking pipelines, the distance between actionable signal and alert fatigue is measured in threshold configuration. Tuning Thresholds for False Positive Reduction operates as a discrete, deterministic evaluation stage within the broader Regression Detection & Rule Engines architecture. This stage consumes normalized performance deltas, applies configurable boundary conditions, and emits classified regression verdicts. It does not capture execution plans, compute raw cost metrics, or trigger remediation workflows. Its sole responsibility is to translate continuous telemetry into discrete, high-confidence signals by eliminating noise through statistically grounded boundary definitions.

Stage Isolation and Data Flow Contract

Strict stage isolation is mandatory for deterministic behavior. The threshold evaluation stage accepts a strictly typed input payload: a normalized delta vector containing baseline cost, observed cost, query hash, execution count, and historical variance. It emits a classification payload containing verdict (PASS, WARN, FAIL), confidence score, triggered rule identifier, and routing tags.

Upstream dependencies are limited to the cost normalization layer, which must guarantee idempotent aggregation windows, timezone-aware sampling, and outlier clipping. Downstream consumers include alert routers, CI/CD gating pipelines, and plan stabilization controllers. Any deviation from this contract—such as embedding raw plan parsing, direct database query execution, or stateful cooldown logic within the threshold engine—breaks pipeline determinism and introduces non-reproducible alert states. The evaluation function must remain pure: identical inputs must always produce identical outputs, regardless of deployment epoch or infrastructure topology.

Statistical Foundations and Boundary Architecture

False positives in regression detection typically stem from three sources: volatile execution counts, seasonal workload shifts, and rigid static boundaries. Effective Tuning Thresholds for False Positive Reduction requires moving beyond fixed percentage deltas. Production-grade implementations layer multiple evaluation strategies into a deterministic cascade:

  • Execution Floor Suppression: Queries below a configurable min_exec_count are bypassed entirely, preventing cold-start artifacts or infrequent batch jobs from skewing statistical baselines.
  • Exponentially Weighted Moving Averages (EWMA): Transient spikes are smoothed using a configurable alpha coefficient (typically 0.15–0.30), preserving sensitivity to sustained degradation while ignoring ephemeral cache misses or lock contention.
  • Variance-Adjusted Z-Scores: Thresholds scale proportionally to observed standard deviation. High-variance OLAP queries receive wider tolerance bands, while OLTP workloads maintain tight boundaries.
  • Dynamic Percentile Boundaries: Rolling 95th-percentile latency or cost baselines replace static floors, adapting organically to workload evolution.

These strategies operate in strict sequence. The upstream Tracking Cost Deltas Across Baseline Versions layer guarantees that the delta vector entering this stage is already normalized against schema drift and plan hash collisions, allowing the threshold engine to focus exclusively on statistical classification.

Implementation Blueprint: Deterministic Threshold Evaluation

The following configuration and implementation demonstrate a production-ready threshold engine. It prioritizes type safety, deterministic evaluation, and explicit fallback behavior.

Configuration Schema (thresholds.yaml)

YAML
evaluation:
  min_exec_count: 50
  ewma_alpha: 0.2
  z_score_threshold: 2.5
  percentile_floor: 0.95
  max_cost_delta_pct: 15.0

routing:
  pass_tags: ["stable", "monitor"]
  warn_tags: ["degraded", "review_required"]
  fail_tags: ["critical", "auto_rollback_candidate"]
  fallback_verdict: "WARN"
  fallback_confidence: 0.45

Evaluation Engine (threshold_evaluator.py)

PYTHON
from __future__ import annotations
import math
import logging
from dataclasses import dataclass, field
from typing import Literal, Optional
from opentelemetry import metrics, trace

logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

eval_duration = meter.create_histogram("threshold_eval_duration_ms", unit="ms")
verdict_counter = meter.create_counter("threshold_verdict_total", unit="1")
fallback_counter = meter.create_counter("threshold_fallback_triggered_total", unit="1")

@dataclass(frozen=True)
class NormalizedDeltaVector:
    query_hash: str
    baseline_cost: float
    observed_cost: float
    execution_count: int
    historical_variance: float

@dataclass
class ClassificationPayload:
    verdict: Literal["PASS", "WARN", "FAIL"]
    confidence: float
    rule_id: str
    routing_tags: list[str]
    evaluation_metadata: dict = field(default_factory=dict)

class ThresholdEvaluator:
    def __init__(self, config: dict):
        self.cfg = config["evaluation"]
        self.routing = config["routing"]

    def evaluate(self, delta: NormalizedDeltaVector) -> ClassificationPayload:
        with tracer.start_as_current_span("evaluate_threshold") as span:
            span.set_attribute("query_hash", delta.query_hash)
            try:
                # 1. Execution floor suppression
                if delta.execution_count < self.cfg["min_exec_count"]:
                    return ClassificationPayload(
                        verdict="PASS", confidence=0.9, rule_id="EXEC_FLOOR",
                        routing_tags=self.routing["pass_tags"],
                        evaluation_metadata={"suppressed": True}
                    )

                # 2. Compute delta metrics
                cost_delta_pct = ((delta.observed_cost - delta.baseline_cost) / delta.baseline_cost) * 100
                std_dev = math.sqrt(delta.historical_variance)
                z_score = (delta.observed_cost - delta.baseline_cost) / std_dev if std_dev > 0 else 0.0

                # 3. EWMA smoothing (simulated for single-window evaluation)
                smoothed_delta = cost_delta_pct * self.cfg["ewma_alpha"] + (1 - self.cfg["ewma_alpha"]) * 0.0

                # 4. Deterministic cascade evaluation
                if smoothed_delta <= self.cfg["max_cost_delta_pct"] and abs(z_score) < self.cfg["z_score_threshold"]:
                    verdict, confidence, rule_id = "PASS", 0.95, "BASELINE_STABLE"
                    tags = self.routing["pass_tags"]
                elif smoothed_delta <= self.cfg["max_cost_delta_pct"] * 1.5 and abs(z_score) < self.cfg["z_score_threshold"] * 1.2:
                    verdict, confidence, rule_id = "WARN", 0.75, "MARGIN_DEGRADATION"
                    tags = self.routing["warn_tags"]
                else:
                    verdict, confidence, rule_id = "FAIL", 0.98, "THRESHOLD_EXCEEDED"
                    tags = self.routing["fail_tags"]

                verdict_counter.add(1, {"verdict": verdict, "rule_id": rule_id})
                return ClassificationPayload(verdict, confidence, rule_id, tags, {"z_score": z_score, "smoothed_delta": smoothed_delta})

            except Exception as e:
                logger.error("Threshold evaluation failed for %s: %s", delta.query_hash, e, exc_info=True)
                fallback_counter.add(1)
                return ClassificationPayload(
                    self.routing["fallback_verdict"],
                    self.routing["fallback_confidence"],
                    "FALLBACK_DEFAULT",
                    self.routing["warn_tags"],
                    {"error": str(e)}
                )

Routing Logic, Observability, and Safe Fallback Protocols

The classification payload drives deterministic routing. PASS verdicts flow to baseline archival and telemetry aggregation. WARN verdicts trigger asynchronous review queues and attach to deployment dashboards. FAIL verdicts route to CI/CD gating pipelines and plan stabilization controllers, potentially halting automated rollouts.

When integrating with structural plan analysis, threshold verdicts should be cross-referenced with Detecting Join Type Shifts in Execution Plans. A FAIL threshold verdict paired with a confirmed join algorithm shift (e.g., Hash JoinNested Loop) elevates routing priority to CRITICAL and bypasses standard cooldown windows.

Observability Hooks

  • Metrics: threshold_eval_duration_ms tracks evaluation latency; threshold_verdict_total provides verdict distribution for alert tuning; threshold_fallback_triggered_total monitors configuration or data integrity failures.
  • Tracing: Each evaluation generates a span containing query_hash, z_score, and smoothed_delta. Correlation IDs propagate to downstream alert routers.
  • Structured Logging: JSON-formatted logs capture the exact boundary conditions evaluated, enabling post-incident threshold calibration without pipeline replay.

Safe Fallback Protocols

  1. Data Gap Handling: If historical_variance is NaN or missing, the engine defaults to a conservative WARN with 0.45 confidence, preventing silent PASS emissions on unprofiled queries.
  2. Config Reload Failure: The evaluator caches the last known-good configuration. If a hot-reload fails validation, the system continues operating on the cached state and emits a CONFIG_STALE metric.
  3. Statistical Divergence Circuit Breaker: If z_score exceeds 5.0 and execution_count drops below the floor simultaneously, the engine flags the input as potentially corrupted and routes to a quarantine queue for manual inspection, rather than forcing a verdict.

For statistical validation and metric instrumentation, refer to the official Python statistics module documentation and the OpenTelemetry Semantic Conventions for standardized metric naming.