Google BigQuery:检索每行的最新版本

时间:2017-07-10 14:33:33

标签: performance google-bigquery version

我有一个包含所有资源版本的Google BigQuery Table。每次创建/更新/删除资源时,都会添加一个新行,增加版本号(此数字将是添加行时的时间戳)

+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| ABC_1 | ABC        | CREATE |    10 | {timestamp} |
| ABC_2 | ABC        | UPDATE |     8 | {timestamp} |
| ABC_3 | ABC        | UPDATE |     4 | {timestamp} |
| ABC_4 | ABC        | DELETE |     4 | {timestamp} |
| -     |            |        |       |             |
| DEF_1 | DEF        | CREATE |    10 | {timestamp} |
| DEF_2 | DEF        | DELETE |    10 | {timestamp} |
| -     |            |        |       |             |
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| -     |            |        |       |             |
| KLM_1 | KLM        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+
  • ID :行的唯一ID,其中包含ResourceID和版本标识符
  • ResourceID :发生操作的资源的ID
  • 操作:资源上发生的操作
  • 计数:与资源相关联的值
  • 时间戳:添加行的时间戳(与唯一ID相同)

我需要撰写一个查询来检索每个资源的所有最新版本

+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| ABC_4 | ABC        | DELETE |     4 | {timestamp} |
| DEF_2 | DEF        | DELETE |    10 | {timestamp} |
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+

此外,需要忽略DELETE状态的所有资源。 所以这是我正在寻找的最终输出

+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+

这是我提出的查询

SELECT ResourceId, Count
FROM worklog_*
WHERE ID IN (
    SELECT max(ID)
    FROM worklog_*
    GROUP BY WorklogID
) AND Action != DELETE

这不是真正的BigQuery查询,但它足以理解行为。 如果可以比较ID列的值,此查询可以正常工作,这就是我选择加入ResourceId和Timestamp的原因,MAX()值将始终提供最后的状态

这是最好的方法吗?是否有人建议更好地进行这种提取?

1 个答案:

答案 0 :(得分:4)

对于BigQuery Standard SQL

  
#standardSQL
WITH worklog AS (
  SELECT 'ABC_1' AS ID, 'ABC' AS ResourceID, 'CREATE' AS Action, 10 AS COUNT UNION ALL
  SELECT 'ABC_2', 'ABC', 'UPDATE', 8 UNION ALL
  SELECT 'ABC_3', 'ABC', 'UPDATE', 4 UNION ALL
  SELECT 'ABC_4', 'ABC', 'DELETE', 4 UNION ALL
  SELECT 'DEF_1', 'DEF', 'CREATE', 10 UNION ALL
  SELECT 'DEF_2', 'DEF', 'DELETE', 10 UNION ALL
  SELECT 'GHJ_1', 'GHJ', 'CREATE', 10 UNION ALL
  SELECT 'KLM_1', 'KLM', 'CREATE', 10 UNION ALL
  SELECT 'KLM_2', 'KLM', 'UPDATE', 5 
)
SELECT * EXCEPT(Last)
FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY ResourceID ORDER BY ID DESC) AS Last
  FROM worklog
  WHERE Action != 'DELETE'
)
WHERE Last = 1
-- ORDER BY ID