在系统上跟踪用户会话并以下列格式存储。有时我会为同一会话ID获取多条记录。
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
4 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117
我曾经使用DISTINCT(session_id
运行sql查询,以便只为每个会话ID保留多条记录中的一条。但我刚刚意识到,即使底行记录了同一会话的更多动作,我的查询也会选择顶部的行。所以你看下表,我的查询保留了第1行和第1行。 3,像这样;
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
然而,我想保留第2行和第3行,就像这样;
Row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
无论如何使用sql查询吗?谢谢!
答案 0 :(得分:2)
以下是BigQuery Standard SQL
的选项之一#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
您可以使用您问题中的虚拟数据进行上述测试/播放,如下所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
ORDER BY row
结果是
row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
另一种选择如下
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT session_id,
ARRAY_AGG(user_actions ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC LIMIT 1)[SAFE_OFFSET(0)] user_actions
FROM `project.dataset.table`
GROUP BY session_id
这个看起来更清洁:o)
如果(例如)某行中缺少某些操作而不是另一行等等,您可以扩展以上,例如组合重复数据删除条目中的不同代码。
更新:
请尝试以下方法将array_length的计算开销与分区中的排序分开:
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id ORDER BY len DESC) = 1 win
FROM (
SELECT *, ARRAY_LENGTH(SPLIT(user_actions)) len
FROM `project.dataset.table`
)
)
WHERE win