ES|QL query patterns for Kibana alerting v2 rules
Some detection problems can't be expressed as a single metric compared to a fixed threshold. You might need to know whether an SLO is burning through its error budget across multiple time windows at once. Or whether a specific host has gone silent, rather than whether the query returned nothing. Or whether a condition has persisted continuously across consecutive time buckets rather than appearing once. These are structurally different problems that require different query shapes.
Use this page when a basic STATS ... WHERE pattern isn't enough, or when the detection logic itself requires multi-window calculation, last-seen reasoning, or bucket-level persistence checks. If you're still learning how Kibana alerting v2 rules work, start with Author rules first.
A threshold query evaluates one metric over one lookback window and fires if a value crosses a limit. It is the simplest rule shape: a STATS aggregation followed by a WHERE condition.
FROM logs-*
| STATS
// Count only error responses; count all requests for the denominator
error_count = COUNT_IF(http.response.status_code >= 500),
total_count = COUNT(*)
BY service.name
| EVAL error_rate = error_count / total_count
| WHERE error_rate > 0.10
| KEEP service.name, error_rate, error_count, total_count
- Compute the error rate as a fraction (0–1)
- Alert condition: services above 10% error rate are breaches
One window, one aggregate, one threshold check. The result is either a breach or no breach for each group.
An SLO burn rate query asks a different question than a basic threshold: are you consuming your error budget faster than you can afford to? Rather than checking a single metric at a fixed limit, it calculates error rates across multiple time windows simultaneously and returns a severity level.
Checking both a short window (for example, 5 minutes) and a long window (for example, 1 hour) together filters out brief spikes that do not represent a real budget threat. CRITICAL fires only when both the short and long burn rates exceed the threshold. The two-window requirement is what separates a genuine budget emergency from a momentary blip.
A single ES|QL query handles all window pairs at once using conditional aggregation:
FROM metrics-*
| WHERE @timestamp >= NOW() - 3 days
// Keep this value in sync with the rule's lookback setting.
| STATS
// CRITICAL window pair: 5 min catches the fast signal, 1 hour confirms it's sustained
errors_5m = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 5 minutes),
total_5m = COUNT_IF(@timestamp >= NOW() - 5 minutes),
errors_1h = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 1 hour),
total_1h = COUNT_IF(@timestamp >= NOW() - 1 hour),
// HIGH window pair: 30 min fast signal, 6 hours sustained confirmation
errors_30m = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 30 minutes),
total_30m = COUNT_IF(@timestamp >= NOW() - 30 minutes),
errors_6h = COUNT_IF(outcome == "failure" AND @timestamp >= NOW() - 6 hours),
total_6h = COUNT_IF(@timestamp >= NOW() - 6 hours)
BY slo.id
| EVAL
// Compute error rates (errors as a fraction of total requests) for each window
burn_5m = errors_5m / total_5m,
burn_1h = errors_1h / total_1h,
burn_30m = errors_30m / total_30m,
burn_6h = errors_6h / total_6h
| EVAL severity = CASE(
// CRITICAL: both the fast and sustained windows exceed 14.4x the baseline error rate.
// Requiring both prevents a single brief spike from triggering a critical alert.
burn_5m > 14.4 AND burn_1h > 14.4, "CRITICAL",
// HIGH: same two-window logic at a lower threshold
burn_30m > 6.0 AND burn_6h > 6.0, "HIGH",
"none"
)
| WHERE severity != "none"
| KEEP slo.id, severity, burn_5m, burn_1h, burn_30m, burn_6h
- Lookback must cover the longest window pair used below.
- Each SLO is evaluated independently
- Only breaching SLOs become alert rows
- Store fields needed for routing and triage
The burn rate multipliers (14.4×, 6×) reflect standard SLO error budget consumption rates. Adjust them to match your SLO targets.
Because the query computes several window pairs in one pass, the lookback window on the rule must cover the longest window in the query (3 days in the example above).
No-data detection inverts the normal pattern. Instead of filtering for data that meets a condition, you query for when data was last seen and flag sources that have gone silent.
The technique uses a broad lookback to find all known hosts, then surfaces only those that have not reported recently:
FROM metrics-*
| WHERE @timestamp >= NOW() - 12 hours
// have at least one event in the window under normal conditions
| STATS last_seen = MAX(@timestamp) BY host.name
| WHERE last_seen < NOW() - 15 minutes
| KEEP host.name, last_seen
- Broad lookback: must be wide enough that all known hosts
- Find the most recent event timestamp per host
- Keep only hosts that have NOT reported in the last 15 minutes
- Each returned row is a silent host — the query result itself is the alert
Every row returned is a host that has gone silent, so the base query itself drives the alert. No separate alert condition is needed.
| Variant | What it detects |
|---|---|
| Host-specific | Each host that stops reporting generates its own alert series (use BY host.name for grouping). |
| Global | No data from any source. Omit the BY clause and check whether the query returns any rows at all. |
| Combined | Flags both a high-metric condition and silent hosts in one query using a CASE expression to assign a status field ("alert", "no data", or "ok"), then filters to only the problematic rows. |
The lookback must be wide enough that known hosts appear in the result set. If the lookback is too short, a silent host falls outside the window and is never checked. However, large lookback windows on high-frequency data streams increase query cost significantly. Start with a lookback that comfortably covers the longest expected reporting gap for your hosts, not the full history of the index.
For no-data behavior when the entire base query returns zero rows (as opposed to detecting specific silent sources), refer to No-data handling.
Some patterns from the classic alerting aggregation API are not directly available in ES|QL, and some require workarounds.
A persistent breach condition detects a metric that stays above a threshold across several consecutive time buckets (for example, "CPU above 90% in all 10 of the last 10 five-minute windows"). ES|QL can express this with bucket counting:
FROM metrics-*
| WHERE @timestamp >= NOW() - 50 minutes
| EVAL bucket = BUCKET(@timestamp, 5 minutes)
| STATS
total_buckets = COUNT_DISTINCT(bucket),
exceeding_buckets = COUNT_DISTINCT(
CASE(system.cpu.total.pct > 0.90, bucket, null)
)
BY host.name
| WHERE total_buckets >= 10
AND exceeding_buckets == total_buckets
// every bucket in the window must have breached
| KEEP host.name, total_buckets, exceeding_buckets
- Lookback must cover all 10 buckets (10 × 5 min = 50 min)
- Assign each event to its 5-minute time bucket
- How many distinct buckets exist in the window
- Count only buckets where CPU exceeded the threshold;
- null values are excluded by COUNT_DISTINCT
- Require a full window of data before firing —
- guards against gaps making a partial breach look persistent;
The rule's lookback window must cover all the buckets you want to check (50 minutes for 10 five-minute buckets in this example). If any bucket is missing from the data because the host stopped reporting briefly mid-window, total_buckets drops below 10 and the condition does not fire. Design the query so that gaps in reporting produce the behavior you want: either treating partial coverage as a non-breach or adjusting the WHERE filter to allow it.
ES|QL does not have a DERIVATIVE function. In the Elasticsearch aggregations API, a derivative pipeline aggregation calculates the rate of change between consecutive time buckets (for example, "how fast is this counter increasing per minute?"). There is no equivalent in ES|QL today.
Use cases that require true per-bucket deltas (such as detecting a sudden acceleration in error rate) cannot be expressed as an ES|QL rule at this time. Consider pre-computing deltas in an ingest pipeline or using a transform to write derived metrics to a separate index that your rule can then query with a standard threshold pattern.