bigquery表中的重复项互相传播秒数

时间:2017-05-30 14:59:46

标签: sql google-bigquery

在我的BigQuery表中,名为[games](大约500万行),其中的行包含以下结构:

user_id    game_id   game_play_time
1234567    3444432   2017-05-30 15:26:57 UTC
1234567    3444432   2017-05-30 15:26:58 UTC
1234567    3444432   2017-05-30 15:26:59 UTC
9876544    8586588   2017-05-30 23:26:11 UTC
4638889    8698798   2017-05-30 15:26:58 UTC
4638889    8698798   2017-05-30 15:27:58 UTC 

我需要删除具有相同user_id和game_id的行,其中后续游戏之间的时差等于或小于一秒(保持第一次出现)。

结果应如下所示:

user_id    game_id   game_play_time
1234567    3444432   2017-05-30 15:26:57 UTC
9876544    8586588   2017-05-30 23:26:11 UTC
4638889    8698798   2017-05-30 15:26:58 UTC
4638889    8698798   2017-05-30 15:27:58 UTC 

2 个答案:

答案 0 :(得分:2)

它对你有用吗?

SELECT
  user_id,
  game_id,
  MIN(game_play_time) game_play_time
FROM(
  SELECT
    user_id,
    game_id,
    game_play_time,
    lead_time,
    (UNIX_SECONDS(lead_time) - UNIX_SECONDS(game_play_time) <= 1) diff
FROM(
  SELECT 
    user_id,
    game_id,
    game_play_time game_play_time,
    LEAD(game_play_time,1) OVER(PARTITION BY user_id, game_id order by game_play_time) lead_time
FROM data
)
)
GROUP BY user_id,game_id, diff
ORDER BY user_id, game_id, game_play_time

其中数据是您的输入数据,我将其定义为:

WITH data AS(
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:57') game_play_time union all
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:58') game_play_time union all
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:59') game_play_time union all
select '9876544' as user_id, '8586588' as game_id, timestamp('2017-05-30 23:26:11') game_play_time union all
select '4638889' as user_id, '8698798' as game_id, timestamp('2017-05-30 15:26:58') game_play_time union all
select '4638889' as user_id, '8698798' as game_id, timestamp('2017-05-30 15:27:58') game_play_time
)

即使它似乎在这里工作,我也不确定是否仍然存在一些无法解决的问题。也许数据中的结果可能会显示一切正常。

答案 1 :(得分:1)

以下是BigQuery Standard SQL

  
#standardSQL
SELECT 
  user_id,
  game_id,
  MIN(game_play_time) AS game_play_time
FROM (
  SELECT
    user_id,
    game_id,
    game_play_time,
    SUM(step) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time) AS grp
  FROM (
    SELECT 
      user_id,
      game_id,
      game_play_time,
      CASE WHEN IFNULL(TIMESTAMP_DIFF(game_play_time, LAG(game_play_time) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time), SECOND), 0) > 1 THEN 1 ELSE 0 END AS step
    FROM YourTable
  )
)
GROUP BY user_id, game_id, grp
--  ORDER BY user_id, game_id, grp

您可以使用以下虚拟数据进行测试(来自您问题中的示例+更多行以使其更通用)

#standardSQL
WITH YourTable AS(
  SELECT '1234567' AS user_id, '3444432' AS game_id, TIMESTAMP('2017-05-30 12:26:57') game_play_time UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 12:26:57') UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 13:26:57') UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 13:26:57') UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:57') UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:57') UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:58') UNION ALL
  SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:59') UNION ALL
  SELECT '9876544', '8586588', TIMESTAMP('2017-05-30 23:26:11') UNION ALL
  SELECT '4638889', '8698798', TIMESTAMP('2017-05-30 15:26:58') UNION ALL
  SELECT '4638889', '8698798', TIMESTAMP('2017-05-30 15:27:58')
)
SELECT 
  user_id,
  game_id,
  MIN(game_play_time) AS game_play_time
FROM (
  SELECT
    user_id,
    game_id,
    game_play_time,
    SUM(step) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time) AS grp
  FROM (
    SELECT 
      user_id,
      game_id,
      game_play_time,
      CASE WHEN IFNULL(TIMESTAMP_DIFF(game_play_time, LAG(game_play_time) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time), SECOND), 0) > 1 THEN 1 ELSE 0 END AS step
    FROM YourTable
  )
)
GROUP BY user_id, game_id, grp
-- ORDER BY user_id, game_id, grp