BigQuery:如何计算每天和类别的不同访问者的运行次数?

时间:2013-11-28 16:16:44

标签: google-bigquery

在Google BigQuery中,我有一个这样的表:

  

startTime:STRING,visitorId:STRING,类别:STRING

此内容的示例:

startTime            visitorId   category
-------------------  ---------   --------
2013-11-27 00:00:00     A           X         
2013-11-27 05:00:00     A           X 
2013-11-27 07:00:00     B           X 
2013-11-28 08:00:00     C           X 

我希望得到以下结果:

day         category  runningCountOfDistinctVisitors  
---------   --------  ------------------------------   
2013-11-27     X                   2
2013-11-28     X                   3

我已经尝试了以下查询,但它似乎不起作用(它在1.2M行表上运行了3个多小时但仍未完成):

SELECT left(a.startTime,10) as day, 
a.category,
count(distinct a.visitorId) as runningCountOfDistinctVisitors
FROM [MyDataset.MyTable] a 
LEFT JOIN EACH [MyDataset.MyTable] b ON a.category = b.category 
WHERE left(b.startTime,10) < left(a.startTime,10)
GROUP EACH BY a.category, day
ORDER BY a.category, day

我也尝试使用分区功能,但似乎不支持count distinct。

5 个答案:

答案 0 :(得分:3)

试试这个:

ts:timestamp,visitor:string,category:string

ts                       visitor  category
-----------------------  -------  --------
2013-11-27 00:00:00 UTC  A        X  
2013-11-27 00:00:00 UTC  A        X  
2013-11-27 00:00:00 UTC  B        X  
2013-11-28 00:00:00 UTC  C        X  
2013-11-27 00:00:00 UTC  A        Y  
2013-11-28 00:00:00 UTC  B        Y  
2013-11-29 00:00:00 UTC  C        Y

查询:

select 
  day, category, sum(cd) 
over
  (partition by category order by day) as running_total
from (select date(ts) as day, category, count(distinct visitor) as cd from
  [test.runningtotal] group by day, category)

这会产生:

day         category  running_total
----------  --------  -------------
2013-11-27  X         2  
2013-11-28  X         3  
2013-11-27  Y         1  
2013-11-28  Y         2  
2013-11-29  Y         3

我没有在大型数据集上测试它,但它可能比JOIN解决方案更快。

答案 1 :(得分:1)

COUNT DISTINCT是一个计算密集型操作(这就是BigQuery提供在1000之后进行近似计数的原因,除非明确要求不这样做)。做几乎CROSS JOIN也是一项密集的操作。将2与大数据集混合使用,您可能遇到计算难以解决的问题。

建议(因为我无权访问您的数据):

  • 使用GROUP EACH执行子查询,而不是COUNT DISTINCT。然后只在外部查询上COUNT。相同的结果,可能有更好的计算分布。
  • 为什么LEFT JOIN EACH而不仅仅是JOIN EACH?

更新:我喜欢Radek的答案,他使用COUNT()OVER()而不是JOIN:https://stackoverflow.com/a/20346427/132438

答案 2 :(得分:1)

意识到我已经很晚了,但它帮助我解决了我正在做的事情,所以我想我还要多一些。

在BigQuery中,您可以在随窗口扩展的数据集上运行不同的滚动计数。

在这个例子中,它看起来像这样。

`SELECT day, category, MAX(runningCountofDistinctVisitors) running_ct
FROM
(SELECT left(a.startTime,10) as day, 
a.category category,
count(distinct a.visitorId)
  OVER(PARTITION BY category
  ORDER BY LEFT(a.startTime,10)) as runningCountOfDistinctVisitors
FROM [MyDataset.MyTable] a 
LEFT JOIN EACH [MyDataset.MyTable] b ON a.category = b.category 
WHERE left(b.startTime,10) < left(a.startTime,10)
GROUP EACH BY a.category, day, a.visitorId)
GROUP EACH BY day, category
ORDER EACH BY day, category`

这解决了计数高于您预期的问题,因为窗口正在扩大以包括当前日期和前几天,而不是将前几天的计数相加。

我相信还有一种方法可以做到这一点,而不需要每天获得最大值的外部查询,但我还没有能够解决这个问题。

答案 3 :(得分:0)

另一种方法是使用

row_number() over (partition by visitor,category order by cast(ts as date)) as row_n 

然后

sum(if(row_n=1,1,0)) over (partition by category order by cast(ts as date)

答案 4 :(得分:0)

先前的某些答案并没有真正做到跨几天的计数差异。他们只是按天数计数,然后将这些天数相加。

这组查询将使您可以计算非常大的数据集上的运行计数差异。我已经对数十亿个唯一ID进行了测试。

它将大量唯一的访问者ID压缩成小草图,每天和类别一个,可以与其他草图合并以快速获得唯一计数。

它使用TEMP TABLE,因为它们比CTE或子查询快得多。

-- create one sketch per <day, category> pair
CREATE TEMP TABLE sketches AS
(SELECT
  category
, DATE(ts) day
, HLL_COUNT.INIT(visitor_id, 24) visitor_sketch
FROM [MyDataset.MyTable]
WHERE DATE(ts) >= DATE_SUB(CURRENT_DATE, INTERVAL 5 DAY)
GROUP BY category, day
);

-- get a list of the days involved in the study
CREATE TEMP TABLE window_starts AS
(
  SELECT DISTINCT(day) day
  FROM sketches
);

/* For each category and day in the study, merge all the 
   corresponding sketches for that category and from the
   beginning of the study up to that day.

   N.B. "MERGE" performs an approximate count distinct
   see https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions#hll_countmerge
*/
SELECT
  category
, window_starts.day
, HLL_COUNT.MERGE(visitor_sketch) visitors
FROM window_starts
CROSS JOIN sketches
WHERE sketches.day <= window_starts.day
GROUP BY category, window_starts.day
ORDER BY 1