BigQuery:计算每个人的时间窗口聚合

时间:2015-09-29 09:37:35

标签: sql aggregate-functions google-bigquery window-functions

给出Google BigQuery中的表格:

sss = StratifiedShuffleSplit(quality['PoorCare'], n_iter=1, test_size=0.25, random_state=0)
for train_index, test_index in sss:
    qualityTrain = quality.iloc[train_index,:]
    qualityTest = quality.iloc[test_index,:]

有一种简单的计算方法:

User  Timestamp 
A     TIMESTAMP(12/05/2015 12:05:01.8023)
B     TIMESTAMP(9/29/2015 12:15:01.0323)
B     TIMESTAMP(9/29/2015 13:05:01.0233)
A     TIMESTAMP(9/29/2015 14:05:01.0432)
C     TIMESTAMP(8/15/2015 5:05:01.0000)
B     TIMESTAMP(9/29/2015 14:06:01.0233)
A     TIMESTAMP(9/29/2015 14:06:01.0432)

其中一小时的时间窗口是参数?

我尝试通过构建LAG和分区函数来解决这两个问题:

BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)

Bigquery SQL for sliding window aggregate

但是发现那些帖子太不相似了,因为我没有找到每个时间窗口的人数,而是在一个时间窗口内找到每个人的最大事件数。

3 个答案:

答案 0 :(得分:7)

这是一种有效的简洁方法,可以利用有序的时间戳结构。

SELECT
  user,
  MAX(per_hour) AS max_event_per_hour
FROM
(
  SELECT 
    user,
    COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
    timestamp
  FROM 
    [dataset_example_in_question_user_timestamps]
)
GROUP BY user

答案 1 :(得分:2)

请尝试以下GBQ。 Haven测试得很多,但看起来对我来说很可行

SELECT
  User, Max(events) as Max_Events
FROM (
  SELECT 
    b.User as User, 
    b.Timestamp as Timestamp,
    COUNT(1) as Events
  FROM [your_dataset.your_table] as b
  JOIN (
    SELECT User, Timestamp 
    FROM [your_dataset.your_table]
    ) as w 
  ON w.User = b.User
  WHERE ROUND((TIMESTAMP_TO_SEC(TIMESTAMP(w.Timestamp)) - 
               TIMESTAMP_TO_SEC(TIMESTAMP(b.Timestamp))) / 3600, 1) BETWEEN 0 AND 1
  GROUP BY 1, 2
)
GROUP BY 1

答案 2 :(得分:1)

我认为您可以使用这样的查询(在T-SQL中):

SELECT "User", SUM(s) As Maximum_Number_of_Events_this_User_Had_in_One_Hour  
FROM (
    SELECT "User", 1 s
    FROM yourTable
    GROUP BY "User", CAST("Timestamp" As date), DATEPART(Hour, "Timestamp")) As t
GROUP BY "User"