我的表格如下:
author | group
daniel | group1,group2,group3,group4,group5,group8,group10
adam | group2,group5,group11,group12
harry | group1,group10,group15,group13,group15,group18
...
...
我希望我的输出看起来像:
author1 | author2 | intersection | union
daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam | harry| 0 | 10
谢谢你
答案 0 :(得分:1)
尝试以下(适用于BigQuery)
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.item=b.item) AS intersection,
EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author
GROUP BY 1,2
为BigQuery Standard SQL添加了解决方案
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, SPLIT(grp) AS grp
FROM YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
(SELECT COUNT(1) FROM a.grp) AS count1,
(SELECT COUNT(1) FROM b.grp) AS count2,
(SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
(SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author
我喜欢这个:
当/如果尝试 - 请务必取消选中显示选项
下的Use Legacy SQL
复选框
答案 1 :(得分:0)
受米哈伊尔·伯利安(Mikhail Berlyant)的第二个答案的启发,这里基本上是为Presto重新格式化的相同方法(作为另一种SQL风格的示例)。同样,所有这些都归功于Mikhail。
python3 -m grpc_tools.protoc --proto_path=api
--proto_path=/Users/Jack/api-common-protos/google
api/v1/foo.proto
请注意,gcloud endpoints services deploy api_descriptor.pb api-config.yaml
和We encountered the following errors while processing this API specification:
API parse error: Error: ENOENT: no such file or directory, open '/tmp/google/api/client.proto'
Please correct these errors and try again.
的计数会略有不同,因为它仅统计唯一的条目,例如WITH
YourTable AS (
SELECT
'daniel' AS author,
'group1,group2,group3,group4,group5,group8,group10' AS grp
UNION ALL
SELECT
'adam' AS author,
'group2,group5,group11,group12' AS grp
UNION ALL
SELECT
'harry' AS author,
'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT
author,
SPLIT(grp, ',') AS grp
FROM
YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
CARDINALITY(a.grp) AS count1,
CARDINALITY(b.grp) AS count2,
CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count,
CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count
FROM tempTable a
JOIN tempTable b ON a.author < b.author
;
有两个harry
值,但只会计算一个:
union_count
答案 2 :(得分:0)
我建议此选项可更好地扩展:
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, grp
FROM YourTable, UNNEST(SPLIT(grp)) as grp
),
intersection AS (
SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
FROM tempTable a
JOIN tempTable b
USING (grp)
WHERE a.author > b.author
GROUP BY a.author, b.author
),
count_distinct_groups AS (
SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
FROM tempTable
GROUP BY author
),
join_it AS (
SELECT
intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
FROM
intersection
JOIN
count_distinct_groups cg1
ON
intersection.author1 = cg1.author
JOIN
count_distinct_groups cg2
ON
intersection.author2 = cg2.author
)
SELECT
*,
count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
join_it
对大数据(数万x百万)的完全交叉联接因过多的改组而失败,而第二个建议需要花费数小时才能执行。那需要几分钟。
这种方法的结果是不会出现没有交集的对,因此使用它来处理IFNULL的过程将由该进程负责。
最后一个细节:丹尼尔和哈里的并集是10,而不是11,因为在最初的示例中重复了第15组。