当从商店购买商品时,我有一个BigQuery表记录。它包含ItemID和时间戳。我对购买的每件商品的运行总数感兴趣。我有这个查询生成运行总计:
SELECT ItemID,timestamp,count(*)
OVER
(PARTITION BY ItemID
ORDER BY timestamp ASC, ItemID) AS runningtotal
from
(
SELECT * FROM [mydb.purchases]
)
ORDER BY timestamp
此表包含数十万行。 我现在要做的是花一段时间(例如一周),并获得该周内每个ItemID的100个运行总计样本(为了绘制没有太多数据点的图表)。 我不知道该怎么做。我可以通过过滤“where(rownumber%(rowcount / 100)= 0”这样的东西来获得100个样本但是如何为表中的每个ItemID执行此操作?我是否需要为每个ItemID执行多个子查询然后创建工会?谢谢
答案 0 :(得分:0)
使用标准SQL,您可以首先使用LIMIT
函数内的ARRAY_AGG
子句收集100个时间戳的样本:
#standardSQL
SELECT ItemID, timestamp, COUNT(*)
OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS running_total
FROM (
SELECT ItemID, ARRAY_AGG(timestamp LIMIT 100) timestamps
FROM `mydb.purchases`) t, t.timestamps timestamp
ORDER BY timestamp
如果这不进行随机抽样,您可以使用RAND()
重新调整时间戳:
#standardSQL
SELECT ItemID, timestamp, COUNT(*)
OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS running_total
FROM (
SELECT ItemID, ARRAY_AGG(timestamp ORDER BY RAND() LIMIT 100) timestamps
FROM `mydb.purchases`) t, t.timestamps timestamp
ORDER BY timestamp
答案 1 :(得分:0)
以下是您在抽样意义上所描述的内容
我将selecting week worse of data
方面排除在外,因为它是微不足道的
#standardSQL
SELECT
ItemID,
timestamp,
runningtotal
FROM (
SELECT
ItemID,
timestamp,
COUNT(1) OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS runningtotal,
ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS rownumber,
COUNT(1) OVER(PARTITION BY ItemID) AS rowcount
FROM `mydb.purchases`
)
WHERE MOD(rownumber, CAST(rowcount/100 AS INT64)) = 0
-- ORDER BY ItemID, timestamp