Guide

Normalizing Query Plans for Cross-Engine Comparison

In heterogeneous database environments, performance regression detection requires a unified representation of execution strategies. Normalizing Query Plans for Cross-Engine Comparison serves as the deterministic transformation layer within the broader Automated EXPLAIN Capture & Storage Workflows pipeline. This stage isolates raw, engine-specific EXPLAIN output and converts it into a canonical intermediate representation (IR) suitable for baseline tracking, structural diffing, and automated alerting. The normalization boundary is strictly enforced: it does not capture raw SQL, evaluate runtime execution metrics, or trigger alerting. Its sole responsibility is structural and semantic translation from vendor-specific formats into a stable, comparable execution graph.

flowchart LR
  PG["PostgreSQL EXPLAIN JSON"] --> NORM["Normalizer"]
  MY["MySQL EXPLAIN JSON"] --> NORM
  SF["Snowflake or distributed SQL"] --> NORM
  NORM --> IR["Canonical IR — execution graph"]
  IR --> BL["Baseline tracking and structural diff"]

Stage Isolation and Input/Output Contracts

The normalization stage operates exclusively downstream of ingestion and upstream of regression analysis. Strict contract enforcement prevents parser drift from corrupting baseline registries and ensures normalization failures never cascade into false-positive regression alerts.

Input Contract:

  • Pre-routed, engine-tagged payloads containing engine_type, engine_version, schema_context, query_hash, and raw plan output (JSON, text, or XML).
  • Payloads must be delivered via structured log streams or message queues. When Routing EXPLAIN ANALYZE Output to Centralized Logs is active, the normalization worker subscribes to partitioned topics keyed by engine_type. For high-throughput environments, Building Async Ingestion Pipelines for High-Throughput Queries guarantees backpressure management and exactly-once delivery semantics before workers consume payloads.

Output Contract:

  • Versioned, schema-validated canonical IR conforming to a published JSON Schema.
  • Deterministic SHA-256 fingerprint (plan_fingerprint) computed over the serialized IR.
  • Explicit exclusion of runtime timing, I/O wait, or buffer hit metrics. These are deferred to downstream telemetry correlation services.

Idempotency is non-negotiable: identical raw inputs must yield byte-identical normalized outputs regardless of worker instance, processing order, or deployment timestamp.

Canonical Intermediate Representation Design

Cross-engine comparison fails when operator nomenclature, cost models, or tree traversal orders diverge. The canonical IR resolves this by mapping vendor-specific constructs to a controlled vocabulary and enforcing deterministic tree serialization.

Core Node Schema:

JSON
{
  "node_id": "uuid-v4",
  "op_type": "ENUM(FULL_TABLE_SCAN, INDEX_SCAN, HASH_JOIN, MERGE_JOIN, SORT, AGGREGATE, FILTER, LIMIT, SUBQUERY)",
  "relation": "string (schema.table or alias)",
  "predicate": "string (normalized to sorted conjunctive normal form)",
  "estimated_cost": "float (normalized to 0.0-1.0 scale)",
  "estimated_rows": "int",
  "parallel_workers": "int",
  "children": ["array of child node objects, sorted deterministically"]
}

Normalization Rules:

  1. Operator Mapping: Vendor terms resolve to the controlled vocabulary. PostgreSQL Seq Scan and MySQL ALL both map to FULL_TABLE_SCAN. Oracle TABLE ACCESS BY INDEX ROWID maps to INDEX_SCAN.
  2. Cost Normalization: Absolute cost units are meaningless across engines. Costs are normalized relative to the plan’s total estimated cost: normalized_cost = node_cost / plan_total_cost.
  3. Tree Determinism: Child nodes are sorted lexicographically by relation then op_type before serialization. This guarantees stable hashing even when the underlying optimizer returns non-deterministic child ordering.
  4. Predicate Canonicalization: Filter conditions are parsed into an abstract syntax tree, sorted by column name, and serialized to a canonical string. This prevents WHERE a=1 AND b=2 and WHERE b=2 AND a=1 from generating distinct fingerprints.

Implementation Architecture and Production Code

The normalization service is implemented as a stateless Python worker using pydantic for schema validation and hashlib for deterministic fingerprinting. The architecture follows a strict adapter pattern: each database engine implements a PlanParser interface, which feeds into a unified Normalizer pipeline.

PYTHON
import hashlib
import json
import logging
from enum import Enum
from typing import Any, Dict, List, Optional
from pydantic import BaseModel, Field, ValidationError

logger = logging.getLogger(__name__)

class OpType(str, Enum):
    FULL_TABLE_SCAN = "FULL_TABLE_SCAN"
    INDEX_SCAN = "INDEX_SCAN"
    HASH_JOIN = "HASH_JOIN"
    MERGE_JOIN = "MERGE_JOIN"
    SORT = "SORT"
    AGGREGATE = "AGGREGATE"
    FILTER = "FILTER"
    LIMIT = "LIMIT"

class CanonicalNode(BaseModel):
    node_id: str
    op_type: OpType
    relation: str
    predicate: Optional[str] = None
    estimated_cost: float = Field(ge=0.0, le=1.0)
    estimated_rows: int = Field(ge=0)
    parallel_workers: int = Field(ge=0)
    children: List["CanonicalNode"] = []

class NormalizedPlan(BaseModel):
    engine_type: str
    engine_version: str
    schema_context: str
    query_hash: str
    plan_fingerprint: str
    root: CanonicalNode

# Vendor-specific mapping registry (truncated for brevity)
OPERATOR_MAP = {
    "postgresql": {"Seq Scan": OpType.FULL_TABLE_SCAN, "Index Scan": OpType.INDEX_SCAN},
    "mysql": {"ALL": OpType.FULL_TABLE_SCAN, "ref": OpType.INDEX_SCAN, "eq_ref": OpType.INDEX_SCAN},
}

def normalize_cost(node_cost: float, total_cost: float) -> float:
    if total_cost == 0:
        return 0.0
    return round(node_cost / total_cost, 6)

def compute_fingerprint(plan_dict: Dict[str, Any]) -> str:
    serialized = json.dumps(plan_dict, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(serialized.encode("utf-8")).hexdigest()

def normalize_payload(raw_payload: Dict[str, Any], engine_type: str, total_cost: float) -> NormalizedPlan:
    try:
        # 1. Parse & map operators (adapter logic omitted for brevity)
        # 2. Recursively build CanonicalNode tree
        # 3. Sort children deterministically
        # 4. Normalize costs
        # 5. Validate & hash
        root_node = CanonicalNode(**_build_canonical_tree(raw_payload["plan"], total_cost))
        plan_dict = {"engine": engine_type, "root": root_node.model_dump(mode="json")}
        fingerprint = compute_fingerprint(plan_dict)
        
        return NormalizedPlan(
            engine_type=engine_type,
            engine_version=raw_payload["engine_version"],
            schema_context=raw_payload["schema_context"],
            query_hash=raw_payload["query_hash"],
            plan_fingerprint=fingerprint,
            root=root_node
        )
    except ValidationError as e:
        logger.error("Schema validation failed during normalization", extra={"error": str(e)})
        raise

Execution Steps:

  1. Ingest & Validate: Deserialize incoming payload, verify required metadata fields.
  2. Engine Adapter Dispatch: Route to the appropriate parser based on engine_type.
  3. Recursive Normalization: Traverse the raw plan tree, map operators, normalize costs, canonicalize predicates.
  4. Deterministic Serialization: Sort children, compute SHA-256 fingerprint.
  5. Schema Validation: Enforce strict JSON Schema compliance before emitting to the output topic.

Observability Hooks and Safe Fallback Protocols

Normalization failures must be contained, observable, and recoverable without blocking the ingestion pipeline.

Observability Stack:

  • Metrics: plan_normalization_success_total, plan_normalization_failure_total, normalization_latency_ms, dlq_depth, circuit_breaker_state.
  • Tracing: OpenTelemetry spans tagged with engine_type, plan_size_bytes, validation_errors. Span context propagates to downstream regression services.
  • Logging: Structured JSON logs at WARN for recoverable schema drift, ERROR for fatal parse failures.

Routing Thresholds & Circuit Breakers:

  • Failure Rate Threshold: If plan_normalization_failure_total / plan_normalization_success_total > 0.05 over a 60-second sliding window, the circuit breaker opens.
  • Circuit Breaker Behavior: On open state, payloads bypass normalization and route directly to a Dead Letter Queue (DLQ) with raw_payload, error_context, and timestamp. An automated alert fires to the SRE on-call channel.
  • Backoff & Recovery: The breaker remains open for 300 seconds, then transitions to half-open. If the next 50 payloads normalize successfully, it closes. Otherwise, it reopens and escalates to P1.

Safe Fallback Protocols:

  1. Partial Normalization: If a node contains unparseable predicates or unsupported operators, the worker strips the offending field, logs a WARN, and continues. The plan_fingerprint is still computed over the remaining valid structure.
  2. Schema Versioning: The IR schema is versioned (v1, v2). Workers reject payloads targeting deprecated schemas but store them in a legacy_dlq for batch migration.
  3. Manual Triage Pipeline: DLQ consumers run a reconciliation job that attempts normalization with updated mapping tables. Successful retries are injected into the main stream with reprocessed: true metadata.

Downstream Routing and Diff Thresholds

Once normalized, the IR is published to a Kafka topic partitioned by query_hash. Downstream regression services consume the stream, compute structural diffs against baseline fingerprints, and evaluate performance thresholds.

Routing Logic:

  • If plan_fingerprint matches the baseline registry within the last 30 days, the payload is archived as a STABLE execution path.
  • If the fingerprint diverges, the payload triggers a structural diff engine. The diff computes an AST edit distance. If the distance exceeds 0.15 (15% structural change), the payload routes to the REGRESSION_ALERT queue.
  • Parameterized queries require additional normalization before plan comparison. See Normalizing Parameterized Queries for Consistent Plan Tracking for the literal-stripping and bind-variable standardization workflow that precedes this stage.

This deterministic normalization boundary ensures that cross-engine performance tracking relies on structural execution semantics rather than vendor-specific formatting artifacts. By enforcing strict contracts, deterministic hashing, and resilient fallback routing, platform teams can scale regression detection across heterogeneous database fleets without introducing false positives or baseline corruption.