我正在尝试查询Google BigQuery公共Reddit数据集。我的目标是使用Jaccards' Index来计算subreddits的相似度,由以下内容定义:
我的计划是在2016年8月根据评论数量选择前N = 1000个子评价。然后计算他们的笛卡尔积,以获得subreddit1, subreddit2
形状的所有子评价的组合。
然后使用这些组合行来查询subreddit1和subreddit 2以及交集之间的用户联合。
我到目前为止的查询是:
SELECT
subreddit1,
subreddit2,
(SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
OR subreddit = subreddit2
LIMIT 1
) as subreddits_union,
(
SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
AND author IN (
SELECT author
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit= subreddit2
GROUP BY author
) as subreddits_intersection
FROM
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
FROM (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) a
CROSS JOIN (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) b
WHERE a.subreddit < b.subreddit
)
理想情况下会给出结果:
subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
Art | Politics | 50000 | 21000
Art | Science | 92320 | 15000
... | ... | ... | ...
但是,此查询给出了以下BigQuery错误:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
我理解。但是我不认为这个查询可以转换为有效的连接。鉴于BQ没有apply方法,有没有办法在不诉诸个别查询的情况下设置此查询?也许使用PARTITION BY
?
答案 0 :(得分:2)
感谢您的回答。这个在返回subreddit union时工作得很好,但是,你将如何实现交集呢?
也许是
的内容WITH top_most AS (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 20
),
authors AS (
SELECT DISTINCT author, subreddit
FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2
答案 1 :(得分:1)
不确定我是否完全理解您尝试计算的内容。但也许这个例子可以帮助提出解决方案:
SELECT
subreddit1,
subreddit2,
COUNT(DISTINCT author)
FROM
`fh-bigquery.reddit_comments.2016_08` as f
CROSS JOIN
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
FROM (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 10
) a
CROSS JOIN (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 10
) b
WHERE a.subreddit < b.subreddit
LIMIT 1000000
)
WHERE f.subreddit = subreddit1 OR f.subreddit = subreddit2
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2