如何计算Google BigQuery中多列的中位数?

时间:2019-04-23 03:42:33

标签: sql google-bigquery

我正在创建一个查询,以按天计算来自两个不同网站的访问量的中位数。

输出应如下所示:

+------------+---------+---------------+
|    date    | website | median_visits |
+------------+---------+---------------+
| 2019-04-01 | A       | median_value  |
| 2019-04-01 | B       | median_value  |
| 2019-04-02 | A       | median_value  |
| 2019-04-02 | B       | median_value  |
| 2019-04-03 | A       | median_value  |
| 2019-04-03 | B       | median_value  |
+------------+---------+---------------+

这是我的表(有20,000行)的样子:

+------------+---------+--------+
|    date    | website | visits |
+------------+---------+--------+
| 2019-04-01 | A       |   10.0 |
| 2019-04-01 | B       |   14.0 |
| 2019-04-02 | A       |   85.0 |
| 2019-04-03 | A       |   75.0 |
| 2019-04-02 | B       |    3.0 |
| 2019-04-02 | B       |   45.0 |
| 2019-04-01 | A       |   12.0 |
| 2019-04-03 | A       |   44.0 |
| 2019-04-01 | A       |   99.0 |
+------------+---------+--------+

查询所需输出的最有效方法是什么?我当前正在使用:

SELECT DISTINCT date, website, median_visits
FROM
 (SELECT  date, website, PERCENTILE_CONT(visits, 0.5) 
  OVER(PARTITION BY date, website) AS median_visits
  FROM table)

1 个答案:

答案 0 :(得分:1)

以下是BigQuery标准SQL的-我不能说这是最好的。我什至不能保证它会更好-但是根据我的测试,我看到了更好的执行计划和插槽使用率。因此,您可以尝试查看数据

#standardSQL
SELECT date, website, 
  (SELECT PERCENTILE_CONT(visit, 0.5) OVER() 
    FROM UNNEST(visits) visit LIMIT 1
  ) AS median_visits
FROM (
  SELECT date, website, ARRAY_AGG(visits) visits
  FROM `project.dataset.table`
  GROUP BY date, website
)