在我的BigQuery表中,名为[games](大约500万行),其中的行包含以下结构:
user_id game_id game_play_time
1234567 3444432 2017-05-30 15:26:57 UTC
1234567 3444432 2017-05-30 15:26:58 UTC
1234567 3444432 2017-05-30 15:26:59 UTC
9876544 8586588 2017-05-30 23:26:11 UTC
4638889 8698798 2017-05-30 15:26:58 UTC
4638889 8698798 2017-05-30 15:27:58 UTC
我需要删除具有相同user_id和game_id的行,其中后续游戏之间的时差等于或小于一秒(保持第一次出现)。
结果应如下所示:
user_id game_id game_play_time
1234567 3444432 2017-05-30 15:26:57 UTC
9876544 8586588 2017-05-30 23:26:11 UTC
4638889 8698798 2017-05-30 15:26:58 UTC
4638889 8698798 2017-05-30 15:27:58 UTC
答案 0 :(得分:2)
它对你有用吗?
SELECT
user_id,
game_id,
MIN(game_play_time) game_play_time
FROM(
SELECT
user_id,
game_id,
game_play_time,
lead_time,
(UNIX_SECONDS(lead_time) - UNIX_SECONDS(game_play_time) <= 1) diff
FROM(
SELECT
user_id,
game_id,
game_play_time game_play_time,
LEAD(game_play_time,1) OVER(PARTITION BY user_id, game_id order by game_play_time) lead_time
FROM data
)
)
GROUP BY user_id,game_id, diff
ORDER BY user_id, game_id, game_play_time
其中数据是您的输入数据,我将其定义为:
WITH data AS(
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:57') game_play_time union all
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:58') game_play_time union all
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:59') game_play_time union all
select '9876544' as user_id, '8586588' as game_id, timestamp('2017-05-30 23:26:11') game_play_time union all
select '4638889' as user_id, '8698798' as game_id, timestamp('2017-05-30 15:26:58') game_play_time union all
select '4638889' as user_id, '8698798' as game_id, timestamp('2017-05-30 15:27:58') game_play_time
)
即使它似乎在这里工作,我也不确定是否仍然存在一些无法解决的问题。也许数据中的结果可能会显示一切正常。
答案 1 :(得分:1)
以下是BigQuery Standard SQL
#standardSQL
SELECT
user_id,
game_id,
MIN(game_play_time) AS game_play_time
FROM (
SELECT
user_id,
game_id,
game_play_time,
SUM(step) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time) AS grp
FROM (
SELECT
user_id,
game_id,
game_play_time,
CASE WHEN IFNULL(TIMESTAMP_DIFF(game_play_time, LAG(game_play_time) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time), SECOND), 0) > 1 THEN 1 ELSE 0 END AS step
FROM YourTable
)
)
GROUP BY user_id, game_id, grp
-- ORDER BY user_id, game_id, grp
您可以使用以下虚拟数据进行测试(来自您问题中的示例+更多行以使其更通用)
#standardSQL
WITH YourTable AS(
SELECT '1234567' AS user_id, '3444432' AS game_id, TIMESTAMP('2017-05-30 12:26:57') game_play_time UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 12:26:57') UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 13:26:57') UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 13:26:57') UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:57') UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:57') UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:58') UNION ALL
SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:59') UNION ALL
SELECT '9876544', '8586588', TIMESTAMP('2017-05-30 23:26:11') UNION ALL
SELECT '4638889', '8698798', TIMESTAMP('2017-05-30 15:26:58') UNION ALL
SELECT '4638889', '8698798', TIMESTAMP('2017-05-30 15:27:58')
)
SELECT
user_id,
game_id,
MIN(game_play_time) AS game_play_time
FROM (
SELECT
user_id,
game_id,
game_play_time,
SUM(step) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time) AS grp
FROM (
SELECT
user_id,
game_id,
game_play_time,
CASE WHEN IFNULL(TIMESTAMP_DIFF(game_play_time, LAG(game_play_time) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time), SECOND), 0) > 1 THEN 1 ELSE 0 END AS step
FROM YourTable
)
)
GROUP BY user_id, game_id, grp
-- ORDER BY user_id, game_id, grp