Functions for creating special groupings (also known as bucketing); as such these need to be used as part of the grouping.
Synopsis:
Input:
numeric expression (typically a field) | |
numeric interval | |
date/time expression (typically a field) | |
date/time interval |
Output: non-empty buckets or groups of the given expression divided according to the given interval
Description. The histogram function takes all matching values and divides them into buckets with fixed size matching the given interval, using (roughly) the following formula:
bucket_key = Math.floor(value / interval) * interval
NULL
; as such the histogram places all missing values in the NULL
group.
Histogram
can be applied on either numeric fields:
SELECT HISTOGRAM(salary, 5000) AS h FROM emp GROUP BY h; h --------------- 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000
or date/time fields:
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) AS c FROM emp GROUP BY h; h | c --------------------+--------------- null |10 1951-04-11T00:00:00Z|1 1952-04-05T00:00:00Z|10 1953-03-31T00:00:00Z|10 1954-03-26T00:00:00Z|7 1955-03-21T00:00:00Z|4 1956-03-15T00:00:00Z|4 1957-03-10T00:00:00Z|6 1958-03-05T00:00:00Z|6 1959-02-28T00:00:00Z|9 1960-02-23T00:00:00Z|7 1961-02-17T00:00:00Z|8 1962-02-12T00:00:00Z|6 1963-02-07T00:00:00Z|7 1964-02-02T00:00:00Z|5
Expressions inside the histogram are also supported as long as the return type is numeric:
SELECT HISTOGRAM(salary % 100, 10) AS h, COUNT(*) AS c FROM emp GROUP BY h; h | c ---------------+--------------- 0 |10 10 |15 20 |10 30 |14 40 |9 50 |9 60 |8 70 |13 80 |3 90 |9
Do note that histograms (and grouping functions in general) allow custom expressions but cannot have any functions applied to them in the GROUP BY
. In other words, the following statement is NOT allowed:
SELECT MONTH(HISTOGRAM(birth_date), 2)) AS h, COUNT(*) as c FROM emp GROUP BY h ORDER BY h DESC;
as it requires two groupings (one for histogram followed by a second for applying the function on top of the histogram groups).
Instead one can rewrite the query to move the expression on the histogram inside of it:
SELECT HISTOGRAM(MONTH(birth_date), 2) AS h, COUNT(*) as c FROM emp GROUP BY h ORDER BY h DESC; h | c ---------------+--------------- 12 |7 10 |17 8 |16 6 |16 4 |18 2 |10 0 |6 null |10
When the histogram in SQL is applied on DATE type instead of DATETIME, the interval specified is truncated to
the multiple of a day. E.g.: for HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '2 3:04' DAY TO MINUTE)
the interval
actually used will be INTERVAL '2' DAY
. If the interval specified is less than 1 day, e.g.:
HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '20' HOUR)
then the interval used will be INTERVAL '1' DAY
.