ES|QL query patterns for Kibana alerting v2 rules

Some detection problems can't be expressed as a single metric compared to a fixed threshold. You might need to know whether an SLO is burning through its error budget across multiple time windows at once. Or whether a specific host has gone silent, rather than whether the query returned nothing. Or whether a condition has persisted continuously across consecutive time buckets rather than appearing once. These are structurally different problems that require different query shapes.

Use this page when a basic STATS ... WHERE pattern isn't enough, or when the detection logic itself requires multi-window calculation, last-seen reasoning, or bucket-level persistence checks. If you're still learning how Kibana alerting v2 rules work, start with Author rules first.

Basic threshold query

A threshold query evaluates one metric over one lookback window and fires if a value crosses a limit. It is the simplest rule shape: a STATS aggregation followed by a WHERE condition.

		FROM logs-*
| STATS
    // Count only error responses; count all requests for the denominator
    error_count = COUNT_IF(http.response.status_code >= 500),
    total_count = COUNT(*)
  BY service.name
| EVAL error_rate = error_count / total_count
| WHERE error_rate > 0.10
| KEEP service.name, error_rate, error_count, total_count
		
	

Compute the error rate as a fraction (0–1)
Alert condition: services above 10% error rate are breaches

One window, one aggregate, one threshold check. The result is either a breach or no breach for each group.

SLO burn rate query

An SLO burn rate query asks a different question than a basic threshold: are you consuming your error budget faster than you can afford to? Rather than checking a single metric at a fixed limit, it calculates error rates across multiple time windows simultaneously and returns a severity level.

Why multiple windows

Checking both a short window (for example, 5 minutes) and a long window (for example, 1 hour) together filters out brief spikes that do not represent a real budget threat. CRITICAL fires only when both the short and long burn rates exceed the threshold. The two-window requirement is what separates a genuine budget emergency from a momentary blip.

Query structure

A single ES|QL query handles all window pairs at once using conditional aggregation:

		FROM metrics-*
| WHERE @timestamp >= NOW() - 3 days
                                       // Keep this value in sync with the rule's lookback setting.
| STATS
    // CRITICAL window pair: 5 min catches the fast signal, 1 hour confirms it's sustained
    errors_5m   = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 5  minutes),
    total_5m    = COUNT_IF(@timestamp >= NOW() - 5  minutes),
    errors_1h   = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 1  hour),
    total_1h    = COUNT_IF(@timestamp >= NOW() - 1  hour),
    // HIGH window pair: 30 min fast signal, 6 hours sustained confirmation
    errors_30m  = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 30 minutes),
    total_30m   = COUNT_IF(@timestamp >= NOW() - 30 minutes),
    errors_6h   = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 6  hours),
    total_6h    = COUNT_IF(@timestamp >= NOW() - 6  hours)
  BY slo.id
| EVAL
    // Compute error rates (errors as a fraction of total requests) for each window
    burn_5m  = errors_5m  / total_5m,
    burn_1h  = errors_1h  / total_1h,
    burn_30m = errors_30m / total_30m,
    burn_6h  = errors_6h  / total_6h
| EVAL severity = CASE(
    // CRITICAL: both the fast and sustained windows exceed 14.4x the baseline error rate.
    // Requiring both prevents a single brief spike from triggering a critical alert.
    burn_5m  > 14.4 AND burn_1h  > 14.4, "CRITICAL",
    // HIGH: same two-window logic at a lower threshold
    burn_30m > 6.0  AND burn_6h  > 6.0,  "HIGH",
    "none"
  )
| WHERE severity != "none"
| KEEP slo.id, severity, burn_5m, burn_1h, burn_30m, burn_6h
		
	

Lookback must cover the longest window pair used below.
Each SLO is evaluated independently
Only breaching SLOs become alert rows
Store fields needed for routing and triage

The burn rate multipliers (14.4×, 6×) reflect standard SLO error budget consumption rates. Adjust them to match your SLO targets.

Because the query computes several window pairs in one pass, the lookback window on the rule must cover the longest window in the query (3 days in the example above).

No-data detection

No-data detection inverts the normal pattern. Instead of filtering for data that meets a condition, you query for when data was last seen and flag sources that have gone silent.

The technique uses a broad lookback to find all known hosts, then surfaces only those that have not reported recently:

		FROM metrics-*
| WHERE @timestamp >= NOW() - 12 hours
                                               // have at least one event in the window under normal conditions
| STATS last_seen = MAX(@timestamp) BY host.name
| WHERE last_seen < NOW() - 15 minutes
| KEEP host.name, last_seen
		
	

Broad lookback: must be wide enough that all known hosts
Find the most recent event timestamp per host
Keep only hosts that have NOT reported in the last 15 minutes
Each returned row is a silent host — the query result itself is the alert

Every row returned is a host that has gone silent, so the base query itself drives the alert. No separate alert condition is needed.

Variants

Variant	What it detects
Host-specific	Each host that stops reporting generates its own alert series (use `BY host.name` for grouping).
Global	No data from any source. Omit the `BY` clause and check whether the query returns any rows at all.
Combined	Flags both a high-metric condition and silent hosts in one query using a `CASE` expression to assign a `status` field (`"alert"`, `"no data"`, or `"ok"`), then filters to only the problematic rows.

Lookback window sizing

The lookback must be wide enough that known hosts appear in the result set. If the lookback is too short, a silent host falls outside the window and is never checked. However, large lookback windows on high-frequency data streams increase query cost significantly. Start with a lookback that comfortably covers the longest expected reporting gap for your hosts, not the full history of the index.

For no-data behavior when the entire base query returns zero rows (as opposed to detecting specific silent sources), refer to No-data handling.

Limitations and workarounds

Some patterns from the classic alerting aggregation API are not directly available in ES|QL, and some require workarounds.

Persistent breach detection

A persistent breach condition detects a metric that stays above a threshold across several consecutive time buckets (for example, "CPU above 90% in all 10 of the last 10 five-minute windows"). ES|QL can express this with bucket counting:

		FROM metrics-*
| WHERE @timestamp >= NOW() - 50 minutes
| EVAL bucket = BUCKET(@timestamp, 5 minutes)
| STATS
    total_buckets     = COUNT_DISTINCT(bucket),
    exceeding_buckets = COUNT_DISTINCT(
      CASE(system.cpu.total.pct > 0.90, bucket, null)
    )
  BY host.name
| WHERE total_buckets >= 10
    AND exceeding_buckets == total_buckets
                                               // every bucket in the window must have breached
| KEEP host.name, total_buckets, exceeding_buckets
		
	

Lookback must cover all 10 buckets (10 × 5 min = 50 min)
Assign each event to its 5-minute time bucket
How many distinct buckets exist in the window
Count only buckets where CPU exceeded the threshold;
null values are excluded by COUNT_DISTINCT
Require a full window of data before firing —
guards against gaps making a partial breach look persistent;

The rule's lookback window must cover all the buckets you want to check (50 minutes for 10 five-minute buckets in this example). If any bucket is missing from the data because the host stopped reporting briefly mid-window, total_buckets drops below 10 and the condition does not fire. Design the query so that gaps in reporting produce the behavior you want: either treating partial coverage as a non-breach or adjusting the WHERE filter to allow it.

Derivative aggregation

ES|QL does not have a DERIVATIVE function. In the Elasticsearch aggregations API, a derivative pipeline aggregation calculates the rate of change between consecutive time buckets (for example, "how fast is this counter increasing per minute?"). There is no equivalent in ES|QL today.

Use cases that require true per-bucket deltas (such as detecting a sudden acceleration in error rate) cannot be expressed as an ES|QL rule at this time. Consider pre-computing deltas in an ingest pipeline or using a transform to write derived metrics to a separate index that your rule can then query with a standard threshold pattern.