删除条件为

时间:2017-12-08 19:57:21

标签: google-bigquery distinct

在系统上跟踪用户会话并以下列格式存储。有时我会为同一会话ID获取多条记录。

Row session_id                              user_actions     
1   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141  
2   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141,142,143,146  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146  
4   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117  

我曾经使用DISTINCT(session_id运行sql查询,以便只为每个会话ID保留多条记录中的一条。但我刚刚意识到,即使底行记录了同一会话的更多动作,我的查询也会选择顶部的行。所以你看下表,我的查询保留了第1行和第1行。 3,像这样;

Row session_id                              user_actions     
1   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146  

然而,我想保留第2行和第3行,就像这样;

Row session_id                              user_actions     
2   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141,142,143,146  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146  

无论如何使用sql查询吗?谢谢!

1 个答案:

答案 0 :(得分:2)

以下是BigQuery Standard SQL

的选项之一   
#standardSQL
SELECT row, session_id, user_actions
FROM (
  SELECT 
    row, session_id, user_actions,
    ROW_NUMBER() OVER(PARTITION BY session_id 
      ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
    ) = 1 win
  FROM `project.dataset.table`
)
WHERE win

您可以使用您问题中的虚拟数据进行上述测试/播放,如下所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
  SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
  SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
  SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117' 
)
SELECT row, session_id, user_actions
FROM (
  SELECT 
    row, session_id, user_actions,
    ROW_NUMBER() OVER(PARTITION BY session_id 
      ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
    ) = 1 win
  FROM `project.dataset.table`
)
WHERE win
ORDER BY row  

结果是

row session_id                              user_actions     
2   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141,142,143,146  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146  

另一种选择如下

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
  SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
  SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
  SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117' 
)
SELECT session_id, 
  ARRAY_AGG(user_actions ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC LIMIT 1)[SAFE_OFFSET(0)] user_actions
FROM `project.dataset.table`
GROUP BY session_id   

这个看起来更清洁:o)

如果(例如)某行中缺少某些操作而不是另一行等等,您可以扩展以上,例如组合重复数据删除条目中的不同代码。

  

更新:

请尝试以下方法将array_length的计算开销与分区中的排序分开:

#standardSQL
SELECT row, session_id, user_actions
FROM (
  SELECT 
    row, session_id, user_actions, 
    ROW_NUMBER() OVER(PARTITION BY session_id ORDER BY len DESC) = 1 win
  FROM (
    SELECT *, ARRAY_LENGTH(SPLIT(user_actions)) len
    FROM `project.dataset.table`
  )
)
WHERE win