Loading

Experimental alerting features alerts

When a rule fires repeatedly on the same problem, a flat list of events doesn't tell you when the issue started, whether it's still happening, or how long it's been going on. Alert episodes fill that gap. Each episode is a persistent record of one issue on one series, from first breach through recovery, with every evaluation appended to the same history. Nothing is overwritten.

Every alert episode moves through these states:

inactive → pending → active → recovering → inactive
		
State What it means
Inactive Problem fully resolved. You get a recovery notification.
Pending Errors detected, but the system is waiting to confirm it's a real problem before fully alerting.
Active Problem confirmed and ongoing. This is when you get notified.
Recovering Errors have stopped, but the system is waiting to confirm it's truly resolved.

Activation and recovery thresholds control how many consecutive evaluations must agree, or how long the condition must persist, before transitioning.

A checkout-latency rule runs in Alert mode every 5 minutes. Latency breaches at 14:05 and clears at 14:50:

  1. 14:00 — Routine check. p95 is within budget. The episode is inactive.
  2. 14:05 — p95 jumps to 3.1s. The first breach is detected. With no activation threshold, the episode opens immediately as active.
  3. 14:10–14:45 — Every evaluation finds high latency. The same episode stays active. No new episodes are created.
  4. 14:50 — p95 drops back under 2s. With no recovery threshold, the episode resolves immediately to inactive.

One problem is tracked in one episode, even though the rule evaluated many times while the condition was ongoing.

The same checkout-latency rule, now with an activation threshold of 2 consecutive breaches and a recovery threshold of 2 consecutive clears:

  1. 14:00 — Routine check. p95 is within budget. The episode is inactive.
  2. 14:05 — p95 jumps to 3.1s. The first breach is detected. The episode is created in pending and the system starts counting consecutive breaches.
  3. 14:10 — p95 is still elevated. The second consecutive breach meets the activation threshold. The episode moves from pending to active, and the engineer is paged.
  4. 14:10–14:45 — Latency stays elevated. The episode remains active.
  5. 14:50 — p95 drops back under 2s. The first clean check moves the episode to recovering. The system starts counting consecutive clears.
  6. 14:55 — A second consecutive clear meets the recovery threshold. The episode moves from recovering to inactive.

Thresholds prevent brief spikes from opening episodes and transient dips from closing them prematurely. The episode waits in pending until the problem is confirmed, and waits in recovering until the resolution is confirmed.

A series is the ongoing relationship between a rule and one specific thing it monitors.

Your rule monitors services. Each service it tracks has its own series, one for checkout-service, one for payment-service, and so on. A series exists for as long as that rule keeps monitoring that service.

Think of it like a patient's medical file. The file exists as long as the patient is in the system. Individual health incidents come and go, but the file persists.

An episode lives inside a series. A series can contain many episodes over its lifetime, one for each time that service had a problem.

Series: checkout-service
│
├── Episode 1: errors on April 10 (active → inactive)
├── Episode 2: errors on April 15 (active → inactive)
└── Episode 3: errors on April 18 (active right now)
		

The series is the container. Episodes are the individual problems that happened within it. When the series breaches again after recovering, a new episode starts.

This means you can track "the checkout service was broken from 02:14 to 03:21" and "the payment service was broken at the same time" as separate episodes, even when both come from the same rule.

Tip

Snooze operates at the series level, not the episode level. If you snooze checkout-service, you're silencing all notifications from that series for the next X hours, regardless of how many new episodes start during that time. You're quieting a specific ongoing situation, not a single alert.

Concept Analogy
Rule A security camera watching the building
Series The camera's feed for one specific door
Episode A specific incident caught on that feed
Rule events The individual video frames

The camera runs continuously (rule), always watching door 3 (series). One night someone breaks in. That's an episode. The frames captured during the break-in are the rule events.

Every time a rule finds a match, it writes a document to .rule-events. Whether that document is a signal or an alert depends on the rule's mode, and that choice determines whether the system only records what happened or actively tracks it through to resolution.

A signal is a one-time observation. The system writes it and moves on, no lifecycle, no notifications, no follow-up. An alert participates in an episode. The system links it to every other document from the same problem, tracks the lifecycle states, and routes notifications through action policies.

Type What it is When it's created
Signal A point-in-time record that the query matched (type: signal). Stored in .rule-events. Rules in Detect mode
Alert A lifecycle-tracked episode with type: alert and episode.* fields. Stored in .rule-events. Rules in Alert mode

A rule in Detect mode only writes signals. It never opens episodes, so action policies have nothing to match against.

Alert events are stored in .rule-events. Triage actions (acknowledge, snooze, resolve) are stored in .alert-actions. Both are queryable in Discover.

Every time you take an action on an episode — acknowledging it, snoozing it, resolving it, editing its tags — Kibana writes a new document to .alert-actions. These documents are append-only and can be queried in Discover for auditing and metrics such as mean time to acknowledge (MTTA).

Both .rule-events and .alert-actions are data streams, append-only, time-series stores optimized for writes. On every rule evaluation, Kibana writes a new document to .rule-events rather than updating the previous one. Each document is a point-in-time snapshot. The episode.status field records the lifecycle state the episode was in at that exact evaluation. Nothing is overwritten.

Because every evaluation produces its own document, you can reconstruct the full history of an episode by querying all documents that share the same episode.id.

Retention is managed automatically through ILM. Older backing indices move through storage tiers and are deleted when the retention window expires. You do not need to manually remove documents. Kibana manages versioning, retention, and lifecycle for both streams. Do not change their mappings or index settings.