使用google bigquery构建直方图

时间:2013-03-17 18:33:22

标签: sql google-bigquery

如何编写使直方图图形渲染更容易的查询?

例如,我们有1亿人年龄,我们想要绘制年龄0-10,11-20,21-30等的直方图/桶...查询是什么样的?

有人做过吗?您是否尝试将查询结果与Google电子表格相关联以绘制直方图?

7 个答案:

答案 0 :(得分:12)

您还可以使用quantiles聚合运算符快速查看年龄分布。

SELECT
  quantiles(age, 10)
FROM mytable

此查询的每一行都对应于年龄列表中该点的年龄。第一个结果是年龄的十分之一的年龄排序,第二个是年龄的2 / 10th,3 / 10th等。

答案 1 :(得分:3)

子查询的想法和“CASE WHEN”一样有效,然后通过以下方式进行分组:

SELECT SUM(field1), bucket 
FROM (
    SELECT field1, CASE WHEN age >=  0 AND age < 10 THEN 1
                        WHEN age >= 10 AND age < 20 THEN 2
                        WHEN age >= 20 AND age < 30 THEN 3
                        ...
                        ELSE -1 END as bucket
    FROM table1) 
GROUP BY bucket

或者,如果存储桶是常规的 -​​ 您可以只划分并转换为整数:

SELECT SUM(field1), bucket 
FROM (
    SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket

答案 2 :(得分:1)

使用#standardSQL和辅助stats查询,我们可以定义直方图应查找的范围。

这里有时间在SFO和JFK之间飞行-有10个水桶:

WITH data AS ( 
    SELECT *, ActualElapsedTime datapoint
    FROM `fh-bigquery.flights.ontime_201903`
    WHERE FlightDate_year = "2018-01-01" 
    AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
  SELECT min+step*i min, min+step*(i+1)max
  FROM (
    SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
    FROM (
      SELECT MIN(datapoint) min, MAX(datapoint) max
      FROM data
    )
  ), UNNEST(i) i
)

SELECT COUNT(*) count, (min+max)/2 avg
FROM data 
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg

enter image description here

enter image description here

答案 3 :(得分:0)

制作这样的子订单:

(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)

然后你可以这样做:

SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)

结果将如下:

Row agegroup count 
1   1       somenumber
2   2       somenumber

我希望这会对你有所帮助。当然,在年龄组中,你可以写出:'0到10'

答案 4 :(得分:0)

您正在寻找单一的信息向量。我通常会这样查询:

select
  count(*) as num,
  integer( age / 10 ) as age_group
from mytable
group by age_group 

任意组都需要一个大的case语句。它会很简单但更长。如果每个桶包含N年,我的例子应该没​​问题。

答案 5 :(得分:0)

使用交叉连接来获取最小值和最大值(在单个元组上不是那么昂贵),您可以获得任何给定存储桶数的标准化存储桶列表:

select
  min(data.VAL) as min,
  max(data.VAL) as max,
  count(data.VAL) as num,
  integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group 

在这个例子中我们得到8个桶(非常自我解释)加上一个用于null VAL

答案 6 :(得分:0)

现在在标准SQL中提供了APPROX_QUANTILES聚合函数。

SELECT
    APPROX_QUANTILES(column, number_of_bins)
...