ES|QL STATS command

The STATS processing command groups rows according to a common value and calculates one or more aggregated values over the grouped rows.

Note

For fast approximate results with confidence intervals, see Approximate STATS queries.

Syntax

		STATS [column1 =] expression1 [WHERE boolean_expression1][,
      ...,
      [columnN =] expressionN [WHERE boolean_expressionN]]
      [BY grouping_expression1[, ..., grouping_expressionN]]
		
	

The name by which the aggregated value is returned. If omitted, the name is equal to the corresponding expression (expressionX). If multiple columns have the same name, all but the rightmost column with this name will be ignored.

expressionX

An expression that computes an aggregated value.

grouping_expressionX

An expression that outputs the values to group by. If its name coincides with one of the computed columns, that column will be ignored.

boolean_expressionX

The condition that must be met for a row to be included in the evaluation of expressionX. Has no effect on grouping_expressionX or other aggregation expressions. Consequently, the following are not equivalent:

... | STATS ... WHERE <condition> ...

... | WHERE <condition> | STATS ...

Note

Individual null values are skipped when computing aggregations.

Description

The STATS processing command groups rows according to a common value and calculates one or more aggregated values over the grouped rows. For the calculation of each aggregated value, the rows in a group can be filtered with WHERE. If BY is omitted, the output table contains exactly one row with the aggregations applied over the entire dataset.

The following aggregation functions are supported:

ABSENT
Returns true if the input expression yields no non-null values.
AVG
Returns the average of a numeric field.
COUNT
Returns the total number of input values.
COUNT_DISTINCT
Returns the approximate number of distinct values.
EARLIEST
Returns the earliest value of a field sorted by timestamp.
FIRST
Returns the earliest occurrence of a field based on a sort field.
LAST
Returns the latest occurrence of a field based on a sort field.
LATEST
Returns the latest value of a field sorted by timestamp.
MAX
Returns the maximum value of a field.
MEDIAN
Returns the median value of a numeric field.
MEDIAN_ABSOLUTE_DEVIATION
Returns the median absolute deviation, a robust measure of variability.
MIN
Returns the minimum value of a field.
PERCENTILE
Returns the value at which a certain percentage of observed values occur.
PRESENT
Returns true if the input expression yields any non-null values.
SAMPLE
Collects sample values for a field.
SPARKLINE
Computes y-axis values of a sparkline graph for an aggregation over time.
ST_CENTROID_AGG
Calculates the spatial centroid over a field with geometry type.
ST_EXTENT_AGG
Calculates the spatial extent bounding box over a field with geometry type.
STD_DEV
Returns the population standard deviation of a numeric field.
SUM
Returns the sum of a numeric expression.
TOP
Collects the top values for a field, including repeated values.
VALUES
Returns unique deduplicated values as a multivalued field.
VARIANCE
Returns the population variance of a numeric field.
WEIGHTED_AVG
Returns the weighted average of a numeric expression.

When STATS is used under the TS source command, time series aggregation functions are also supported, with different semantics when BY is omitted or combined with WITHOUT ( ). For more details, check out the TS source command.

The following grouping functions are supported:

BUCKET
Creates groups of values (buckets) from a datetime or numeric input.
CATEGORIZE
Groups text messages into categories of similarly formatted text values.
TBUCKET
Creates timestamp-based buckets aligned to calendar boundaries.
WITHOUT

Note

STATS without any groups is much much faster than adding a group.

Note

Grouping on a single expression is currently much more optimized than grouping on many expressions. In some tests we have seen grouping on a single keyword column to be five times faster than grouping on two keyword columns. Do not try to work around this by combining the two columns together with something like CONCAT and then grouping - that is not going to be faster.

Examples

The following examples demonstrate common STATS patterns.

Group by column

Combine an aggregation with BY to compute a value for each group:

		FROM employees
| STATS count = COUNT(emp_no) BY languages
| SORT languages
		
	

count:long	languages:integer
15	1
19	2
17	3
18	4
21	5
10	null

Aggregate without grouping

Omitting BY returns one row with the aggregations applied over the entire dataset:

		FROM employees
| STATS avg_lang = AVG(languages)

avg_lang:double
3.1222222222222222

Calculate multiple values

Separate multiple aggregations with commas to compute them in a single pass:

		FROM employees
| STATS avg_lang = AVG(languages), max_lang = MAX(languages)

avg_lang:double	max_lang:integer
3.1222222222222222	5

Filter aggregations with WHERE

Use per-aggregation WHERE to compute conditional metrics from the same dataset in a single pass:

		FROM employees
| STATS avg50s = AVG(salary)::LONG WHERE birth_date < "1960-01-01",
        avg60s = AVG(salary)::LONG WHERE birth_date >= "1960-01-01"
        BY gender
| SORT gender
		
	

avg50s:long	avg60s:long	gender:keyword
55462	46637	F
48279	44879	M

Mix filtered and unfiltered aggregations

Filtered and unfiltered aggregations can be freely mixed. Grouping is also optional:

		FROM employees
| EVAL Ks = salary / 1000
| STATS under_40K = COUNT(*) WHERE Ks < 40,
        inbetween = COUNT(*) WHERE 40 <= Ks AND Ks < 60,
        over_60K  = COUNT(*) WHERE 60 <= Ks,
        total     = COUNT(*)
		
	

thousands

under_40K:long	inbetween:long	over_60K:long	total:long
36	39	25	100

Filter on the grouping key

The WHERE clause can also filter on the grouping key. The group itself will still appear in the output, but with a default value for the aggregation:

		FROM employees
| STATS `count` = COUNT(*) WHERE languages < 3 BY languages

count:long	languages:integer
15	1
19	2
0	3
0	4
0	5
0	null

Compare this to filtering with WHERE before STATS, where rows are excluded before grouping, so non-matching groups don't appear in the output at all:

		FROM employees
| WHERE languages < 3
| STATS `count` = COUNT(*) BY languages
		
	

count:long	languages:integer
15	1
19	2

Group by multiple values

Separate multiple grouping expressions with a comma:

		FROM employees
| EVAL hired = DATE_FORMAT("yyyy", hire_date)
| STATS avg_salary = AVG(salary) BY hired, languages.long
| EVAL avg_salary = ROUND(avg_salary)
| SORT hired, languages.long
		
	

Multivalued inputs

If the grouping key is multivalued then the input row is in all groups:

		ROW price = 10, color = ["blue", "pink", "yellow"]
| STATS SUM(price) BY color

SUM(price):long	color:keyword
10	blue
10	pink
10	yellow

If all the grouping keys are multivalued then the input row is in all groups:

		ROW price = 10, color = ["blue", "pink", "yellow"], size = ["s", "m", "l"]
| STATS SUM(price) BY color, size

SUM(price):long	color:keyword	size:keyword
10	blue	l
10	blue	m
10	blue	s
10	pink	l
10	pink	m
10	pink	s
10	yellow	l
10	yellow	m
10	yellow	s

The input ROW is in all groups. The entire row. All the values. Even group keys. That means that:

		ROW color = ["blue", "pink", "yellow"]
| STATS VALUES(color) BY color

VALUES(color):keyword	color:keyword
[blue, pink, yellow]	blue
[blue, pink, yellow]	pink
[blue, pink, yellow]	yellow

The VALUES function above sees the whole row - all of the values of the group key. If you want to send the group key to the function then MV_EXPAND first:

		ROW color = ["blue", "pink", "yellow"]
| MV_EXPAND color
| STATS VALUES(color) BY color
		
	

VALUES(color):keyword	color:keyword
blue	blue
pink	pink
yellow	yellow

Refer to elasticsearch/issues/134792 for an even more in depth explanation.

Multivalue functions

Both aggregation and grouping expressions accept other functions, which is useful for using STATS on multivalue columns. For example, to calculate the average salary change, use MV_AVG to first average the multiple values per employee, then pass the result to AVG:

		FROM employees
| STATS avg_salary_change = ROUND(AVG(MV_AVG(salary_change)), 10)

avg_salary_change:double
1.3904535865

Grouping expressions aren't limited to column references: any expression works. For example, group by a derived value using LEFT:

		FROM employees
| STATS my_count = COUNT() BY LEFT(last_name, 1)
| SORT `LEFT(last_name, 1)`
		
	

my_count:long	LEFT(last_name, 1):keyword
2	A
11	B
5	C
5	D
2	E
4	F
4	G
6	H
2	J
3	K
5	L
12	M
4	N
1	O
7	P
5	R
13	S
4	T
2	W
3	Z

Output column naming

Specifying the output column name is optional. If not specified, the new column name is equal to the expression. The following query returns a column named AVG(salary):

		FROM employees
| STATS AVG(salary)

AVG(salary):double
48248.55

Because this name contains special characters, it needs to be quoted with backticks (```) when using it in subsequent commands:

		FROM employees
| STATS AVG(salary)
| EVAL avg_salary_rounded = ROUND(`AVG(salary)`)
		
	

AVG(salary):double	avg_salary_rounded:double
48248.55	48249.0