ES|QL STATS command
The STATS processing command groups rows according to a common value
and calculates one or more aggregated values over the grouped rows.
For fast approximate results with confidence intervals, see Approximate STATS queries.
STATS [column1 =] expression1 [WHERE boolean_expression1][,
...,
[columnN =] expressionN [WHERE boolean_expressionN]]
[BY grouping_expression1[, ..., grouping_expressionN]]
columnX- The name by which the aggregated value is returned. If omitted, the name is
equal to the corresponding expression (
expressionX). If multiple columns have the same name, all but the rightmost column with this name will be ignored. expressionX- An expression that computes an aggregated value.
grouping_expressionX- An expression that outputs the values to group by. If its name coincides with one of the computed columns, that column will be ignored.
boolean_expressionX-
The condition that must be met for a row to be included in the evaluation of
expressionX. Has no effect ongrouping_expressionXor other aggregation expressions. Consequently, the following are not equivalent:... | STATS ... WHERE <condition> ...... | WHERE <condition> | STATS ...
Individual null values are skipped when computing aggregations.
The STATS processing command groups rows according to a common value
and calculates one or more aggregated values over the grouped rows. For the
calculation of each aggregated value, the rows in a group can be filtered with
WHERE. If BY is omitted, the output table contains exactly one row with
the aggregations applied over the entire dataset.
The following aggregation functions are supported:
ABSENTReturns true if the input expression yields no non-null values.
AVGReturns the average of a numeric field.
COUNTReturns the total number of input values.
COUNT_DISTINCTReturns the approximate number of distinct values.
EARLIESTReturns the earliest value of a field sorted by timestamp.
FIRSTReturns the earliest occurrence of a field based on a sort field.
LASTReturns the latest occurrence of a field based on a sort field.
LATESTReturns the latest value of a field sorted by timestamp.
MAXReturns the maximum value of a field.
MEDIANReturns the median value of a numeric field.
MEDIAN_ABSOLUTE_DEVIATIONReturns the median absolute deviation, a robust measure of variability.
MINReturns the minimum value of a field.
PERCENTILEReturns the value at which a certain percentage of observed values occur.
PRESENTReturns true if the input expression yields any non-null values.
SAMPLECollects sample values for a field.
SPARKLINEComputes y-axis values of a sparkline graph for an aggregation over time.
ST_CENTROID_AGGCalculates the spatial centroid over a field with geometry type.
ST_EXTENT_AGGCalculates the spatial extent bounding box over a field with geometry type.
STD_DEVReturns the population standard deviation of a numeric field.
SUMReturns the sum of a numeric expression.
TOPCollects the top values for a field, including repeated values.
VALUESReturns unique deduplicated values as a multivalued field.
VARIANCEReturns the population variance of a numeric field.
WEIGHTED_AVGReturns the weighted average of a numeric expression.
When STATS is used under the TS source command,
time series aggregation functions
are also supported, with different semantics when BY is omitted or combined with
WITHOUT
(
TS source command.
The following grouping functions are supported:
BUCKETCreates groups of values (buckets) from a datetime or numeric input.
CATEGORIZEGroups text messages into categories of similarly formatted text values.
TBUCKETCreates timestamp-based buckets aligned to calendar boundaries.
WITHOUT
STATS without any groups is much much faster than adding a group.
Grouping on a single expression is currently much more optimized than grouping
on many expressions. In some tests we have seen grouping on a single keyword
column to be five times faster than grouping on two keyword columns. Do
not try to work around this by combining the two columns together with
something like CONCAT
and then grouping - that is not going to be faster.
The following examples demonstrate common STATS patterns.
Combine an aggregation with BY to compute a value for each group:
FROM employees
| STATS count = COUNT(emp_no) BY languages
| SORT languages
| count:long | languages:integer |
|---|---|
| 15 | 1 |
| 19 | 2 |
| 17 | 3 |
| 18 | 4 |
| 21 | 5 |
| 10 | null |
Omitting BY returns one row with the aggregations applied over the entire
dataset:
FROM employees
| STATS avg_lang = AVG(languages)
| avg_lang:double |
|---|
| 3.1222222222222222 |
Separate multiple aggregations with commas to compute them in a single pass:
FROM employees
| STATS avg_lang = AVG(languages), max_lang = MAX(languages)
| avg_lang:double | max_lang:integer |
|---|---|
| 3.1222222222222222 | 5 |
Use per-aggregation WHERE to compute conditional metrics from the same
dataset in a single pass:
FROM employees
| STATS avg50s = AVG(salary)::LONG WHERE birth_date < "1960-01-01",
avg60s = AVG(salary)::LONG WHERE birth_date >= "1960-01-01"
BY gender
| SORT gender
| avg50s:long | avg60s:long | gender:keyword |
|---|---|---|
| 55462 | 46637 | F |
| 48279 | 44879 | M |
Filtered and unfiltered aggregations can be freely mixed. Grouping is also optional:
FROM employees
| EVAL Ks = salary / 1000
| STATS under_40K = COUNT(*) WHERE Ks < 40,
inbetween = COUNT(*) WHERE 40 <= Ks AND Ks < 60,
over_60K = COUNT(*) WHERE 60 <= Ks,
total = COUNT(*)
- thousands
| under_40K:long | inbetween:long | over_60K:long | total:long |
|---|---|---|---|
| 36 | 39 | 25 | 100 |
The WHERE clause can also filter on the grouping key. The group itself will
still appear in the output, but with a default value for the aggregation:
FROM employees
| STATS `count` = COUNT(*) WHERE languages < 3 BY languages
| count:long | languages:integer |
|---|---|
| 15 | 1 |
| 19 | 2 |
| 0 | 3 |
| 0 | 4 |
| 0 | 5 |
| 0 | null |
Compare this to filtering with WHERE before STATS, where rows are excluded
before grouping, so non-matching groups don't appear in the output at all:
FROM employees
| WHERE languages < 3
| STATS `count` = COUNT(*) BY languages
| count:long | languages:integer |
|---|---|
| 15 | 1 |
| 19 | 2 |
Separate multiple grouping expressions with a comma:
FROM employees
| EVAL hired = DATE_FORMAT("yyyy", hire_date)
| STATS avg_salary = AVG(salary) BY hired, languages.long
| EVAL avg_salary = ROUND(avg_salary)
| SORT hired, languages.long
If the grouping key is multivalued then the input row is in all groups:
ROW price = 10, color = ["blue", "pink", "yellow"]
| STATS SUM(price) BY color
| SUM(price):long | color:keyword |
|---|---|
| 10 | blue |
| 10 | pink |
| 10 | yellow |
If all the grouping keys are multivalued then the input row is in all groups:
ROW price = 10, color = ["blue", "pink", "yellow"], size = ["s", "m", "l"]
| STATS SUM(price) BY color, size
| SUM(price):long | color:keyword | size:keyword |
|---|---|---|
| 10 | blue | l |
| 10 | blue | m |
| 10 | blue | s |
| 10 | pink | l |
| 10 | pink | m |
| 10 | pink | s |
| 10 | yellow | l |
| 10 | yellow | m |
| 10 | yellow | s |
The input ROW is in all groups. The entire row. All the values. Even group keys. That means that:
ROW color = ["blue", "pink", "yellow"]
| STATS VALUES(color) BY color
| VALUES(color):keyword | color:keyword |
|---|---|
| [blue, pink, yellow] | blue |
| [blue, pink, yellow] | pink |
| [blue, pink, yellow] | yellow |
The VALUES function above sees the whole row - all of the values of the group
key. If you want to send the group key to the function then MV_EXPAND first:
ROW color = ["blue", "pink", "yellow"]
| MV_EXPAND color
| STATS VALUES(color) BY color
| VALUES(color):keyword | color:keyword |
|---|---|
| blue | blue |
| pink | pink |
| yellow | yellow |
Refer to elasticsearch/issues/134792 for an even more in depth explanation.
Both aggregation and grouping expressions accept other functions, which is
useful for using STATS on multivalue columns. For example, to calculate the
average salary change, use MV_AVG to first average the multiple values per
employee, then pass the result to AVG:
FROM employees
| STATS avg_salary_change = ROUND(AVG(MV_AVG(salary_change)), 10)
| avg_salary_change:double |
|---|
| 1.3904535865 |
Grouping expressions aren't limited to column references: any expression
works. For example, group by a derived value using LEFT:
FROM employees
| STATS my_count = COUNT() BY LEFT(last_name, 1)
| SORT `LEFT(last_name, 1)`
| my_count:long | LEFT(last_name, 1):keyword |
|---|---|
| 2 | A |
| 11 | B |
| 5 | C |
| 5 | D |
| 2 | E |
| 4 | F |
| 4 | G |
| 6 | H |
| 2 | J |
| 3 | K |
| 5 | L |
| 12 | M |
| 4 | N |
| 1 | O |
| 7 | P |
| 5 | R |
| 13 | S |
| 4 | T |
| 2 | W |
| 3 | Z |
Specifying the output column name is optional. If not specified, the new column
name is equal to the expression. The following query returns a column named
AVG(salary):
FROM employees
| STATS AVG(salary)
| AVG(salary):double |
|---|
| 48248.55 |
Because this name contains special characters, it needs to be quoted with backticks (```) when using it in subsequent commands:
FROM employees
| STATS AVG(salary)
| EVAL avg_salary_rounded = ROUND(`AVG(salary)`)
| AVG(salary):double | avg_salary_rounded:double |
|---|---|
| 48248.55 | 48249.0 |