Loading

ES|QL STATS command

The STATS processing command groups rows according to a common value and calculates one or more aggregated values over the grouped rows.

Note

For fast approximate results with confidence intervals, see Approximate STATS queries.

STATS [column1 =] expression1 [WHERE boolean_expression1][,
      ...,
      [columnN =] expressionN [WHERE boolean_expressionN]]
      [BY grouping_expression1[, ..., grouping_expressionN]]
		
columnX
The name by which the aggregated value is returned. If omitted, the name is equal to the corresponding expression (expressionX). If multiple columns have the same name, all but the rightmost column with this name will be ignored.
expressionX
An expression that computes an aggregated value.
grouping_expressionX
An expression that outputs the values to group by. If its name coincides with one of the computed columns, that column will be ignored.
boolean_expressionX

The condition that must be met for a row to be included in the evaluation of expressionX. Has no effect on grouping_expressionX or other aggregation expressions. Consequently, the following are not equivalent:

... | STATS ... WHERE <condition> ...
		
... | WHERE <condition> | STATS ...
		
Note

Individual null values are skipped when computing aggregations.

The STATS processing command groups rows according to a common value and calculates one or more aggregated values over the grouped rows. For the calculation of each aggregated value, the rows in a group can be filtered with WHERE. If BY is omitted, the output table contains exactly one row with the aggregations applied over the entire dataset.

The following aggregation functions are supported:

  • ABSENT

    Returns true if the input expression yields no non-null values.

  • AVG

    Returns the average of a numeric field.

  • COUNT

    Returns the total number of input values.

  • COUNT_DISTINCT

    Returns the approximate number of distinct values.

  • EARLIEST

    Returns the earliest value of a field sorted by timestamp.

  • FIRST

    Returns the earliest occurrence of a field based on a sort field.

  • LAST

    Returns the latest occurrence of a field based on a sort field.

  • LATEST

    Returns the latest value of a field sorted by timestamp.

  • MAX

    Returns the maximum value of a field.

  • MEDIAN

    Returns the median value of a numeric field.

  • MEDIAN_ABSOLUTE_DEVIATION

    Returns the median absolute deviation, a robust measure of variability.

  • MIN

    Returns the minimum value of a field.

  • PERCENTILE

    Returns the value at which a certain percentage of observed values occur.

  • PRESENT

    Returns true if the input expression yields any non-null values.

  • SAMPLE

    Collects sample values for a field.

  • SPARKLINE

    Computes y-axis values of a sparkline graph for an aggregation over time.

  • ST_CENTROID_AGG

    Calculates the spatial centroid over a field with geometry type.

  • ST_EXTENT_AGG

    Calculates the spatial extent bounding box over a field with geometry type.

  • STD_DEV

    Returns the population standard deviation of a numeric field.

  • SUM

    Returns the sum of a numeric expression.

  • TOP

    Collects the top values for a field, including repeated values.

  • VALUES

    Returns unique deduplicated values as a multivalued field.

  • VARIANCE

    Returns the population variance of a numeric field.

  • WEIGHTED_AVG

    Returns the weighted average of a numeric expression.

When STATS is used under the TS source command, time series aggregation functions are also supported, with different semantics when BY is omitted or combined with WITHOUT ( ). For more details, check out the TS source command.

The following grouping functions are supported:

  • BUCKET

    Creates groups of values (buckets) from a datetime or numeric input.

  • CATEGORIZE

    Groups text messages into categories of similarly formatted text values.

  • TBUCKET

    Creates timestamp-based buckets aligned to calendar boundaries.

  • WITHOUT
Note

STATS without any groups is much much faster than adding a group.

Note

Grouping on a single expression is currently much more optimized than grouping on many expressions. In some tests we have seen grouping on a single keyword column to be five times faster than grouping on two keyword columns. Do not try to work around this by combining the two columns together with something like CONCAT and then grouping - that is not going to be faster.

The following examples demonstrate common STATS patterns.

Combine an aggregation with BY to compute a value for each group:

FROM employees
| STATS count = COUNT(emp_no) BY languages
| SORT languages
		
count:long languages:integer
15 1
19 2
17 3
18 4
21 5
10 null

Omitting BY returns one row with the aggregations applied over the entire dataset:

FROM employees
| STATS avg_lang = AVG(languages)
		
avg_lang:double
3.1222222222222222

Separate multiple aggregations with commas to compute them in a single pass:

FROM employees
| STATS avg_lang = AVG(languages), max_lang = MAX(languages)
		
avg_lang:double max_lang:integer
3.1222222222222222 5

Use per-aggregation WHERE to compute conditional metrics from the same dataset in a single pass:

FROM employees
| STATS avg50s = AVG(salary)::LONG WHERE birth_date < "1960-01-01",
        avg60s = AVG(salary)::LONG WHERE birth_date >= "1960-01-01"
        BY gender
| SORT gender
		
avg50s:long avg60s:long gender:keyword
55462 46637 F
48279 44879 M

Filtered and unfiltered aggregations can be freely mixed. Grouping is also optional:

FROM employees
| EVAL Ks = salary / 1000
| STATS under_40K = COUNT(*) WHERE Ks < 40,
        inbetween = COUNT(*) WHERE 40 <= Ks AND Ks < 60,
        over_60K  = COUNT(*) WHERE 60 <= Ks,
        total     = COUNT(*)
		
  1. thousands
under_40K:long inbetween:long over_60K:long total:long
36 39 25 100

The WHERE clause can also filter on the grouping key. The group itself will still appear in the output, but with a default value for the aggregation:

FROM employees
| STATS `count` = COUNT(*) WHERE languages < 3 BY languages
		
count:long languages:integer
15 1
19 2
0 3
0 4
0 5
0 null

Compare this to filtering with WHERE before STATS, where rows are excluded before grouping, so non-matching groups don't appear in the output at all:

FROM employees
| WHERE languages < 3
| STATS `count` = COUNT(*) BY languages
		
count:long languages:integer
15 1
19 2

Separate multiple grouping expressions with a comma:

FROM employees
| EVAL hired = DATE_FORMAT("yyyy", hire_date)
| STATS avg_salary = AVG(salary) BY hired, languages.long
| EVAL avg_salary = ROUND(avg_salary)
| SORT hired, languages.long
		

If the grouping key is multivalued then the input row is in all groups:

ROW price = 10, color = ["blue", "pink", "yellow"]
| STATS SUM(price) BY color
		
SUM(price):long color:keyword
10 blue
10 pink
10 yellow

If all the grouping keys are multivalued then the input row is in all groups:

ROW price = 10, color = ["blue", "pink", "yellow"], size = ["s", "m", "l"]
| STATS SUM(price) BY color, size
		
SUM(price):long color:keyword size:keyword
10 blue l
10 blue m
10 blue s
10 pink l
10 pink m
10 pink s
10 yellow l
10 yellow m
10 yellow s

The input ROW is in all groups. The entire row. All the values. Even group keys. That means that:

ROW color = ["blue", "pink", "yellow"]
| STATS VALUES(color) BY color
		
VALUES(color):keyword color:keyword
[blue, pink, yellow] blue
[blue, pink, yellow] pink
[blue, pink, yellow] yellow

The VALUES function above sees the whole row - all of the values of the group key. If you want to send the group key to the function then MV_EXPAND first:

ROW color = ["blue", "pink", "yellow"]
| MV_EXPAND color
| STATS VALUES(color) BY color
		
VALUES(color):keyword color:keyword
blue blue
pink pink
yellow yellow

Refer to elasticsearch/issues/134792 for an even more in depth explanation.

Both aggregation and grouping expressions accept other functions, which is useful for using STATS on multivalue columns. For example, to calculate the average salary change, use MV_AVG to first average the multiple values per employee, then pass the result to AVG:

FROM employees
| STATS avg_salary_change = ROUND(AVG(MV_AVG(salary_change)), 10)
		
avg_salary_change:double
1.3904535865

Grouping expressions aren't limited to column references: any expression works. For example, group by a derived value using LEFT:

FROM employees
| STATS my_count = COUNT() BY LEFT(last_name, 1)
| SORT `LEFT(last_name, 1)`
		
my_count:long LEFT(last_name, 1):keyword
2 A
11 B
5 C
5 D
2 E
4 F
4 G
6 H
2 J
3 K
5 L
12 M
4 N
1 O
7 P
5 R
13 S
4 T
2 W
3 Z

Specifying the output column name is optional. If not specified, the new column name is equal to the expression. The following query returns a column named AVG(salary):

FROM employees
| STATS AVG(salary)
		
AVG(salary):double
48248.55

Because this name contains special characters, it needs to be quoted with backticks (```) when using it in subsequent commands:

FROM employees
| STATS AVG(salary)
| EVAL avg_salary_rounded = ROUND(`AVG(salary)`)
		
AVG(salary):double avg_salary_rounded:double
48248.55 48249.0