SQL REDDIT - Jaccard相似度

时间:2016-04-21 17:56:22

标签: sql comments google-bigquery reddit

我正在尝试实现一个奇特的SQL查询,但是在尝试执行连接和计数方面遇到了麻烦。

我有一个很长的数据表:

author | group | id |

daniel | group1| 118
adam   | group2| 126
harry  | group1| 221
daniel | group2| 323
daniel | group2| 122
daniel | group5| 322
harry  | group1| 222 
harry  | group1| 225

... ...

我希望我的输出看起来像:

author1 | author2 | intersection | union

daniel | adam | 2 | 3
daniel | harry| 2 | 11
adam   | harry| 0 | 10

其中交集被定义为author1&的组的数量。 author2有共同之处,而union =#of groups author1 + author - intersection。

我认为正确的方法是

表a.group == b.group

上的左连接b表

但我无法弄清楚如何进行总计数。

感谢enter code here

2 个答案:

答案 0 :(得分:1)

“跳入”因为1)仍然没有看到任何答案2)看到作者与BigQuery标签相关的问题

因此,从理论上讲,下面的查询会使您的任务成为可能(使用bigquery-samples.reddit.full表作为以下示例):

BigQuery Legacy SQL:

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.subr = b.subr) AS count_intersection,
  EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

BigQuery标准SQL:

WITH subrs AS (
  SELECT author, subr 
  FROM `bigquery-samples.reddit.full` 
  GROUP BY 1, 2
)
SELECT
  a.author AS author1, 
  b.author AS author2, 
  COUNTIF(a.subr = b.subr) AS count_intersection,
  COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a 
JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

如果您尝试运行它们,则最有可能低于错误

  

发生内部错误,无法完成请求

原因是因为这两个查询中的每个查询都会产生大约一万亿行(参见下面的统计信息)。 有很多方法可以解决这个问题 - 下面提出的方法是通过调整要求来解决这个问题。 你是否真的需要参与算法轻量级作者,让我们说只有一两个subreddits? 或者 - 你真的想找到那些在特定子评价中只有很少评论的人之间的相似性吗?

请参阅下文,如何引入额外限制有助于执行上述查询(注意:lines是每个作者每个子项的最小限制条目数,subrs是每个用户的子数最小限制数< / p>

enter image description here

以下是实际产生无任何类型故障的结果的版本:

标准SQL

WITH authors AS (
  SELECT author FROM (
    SELECT author, COUNT(1) AS subrs FROM (
      SELECT author, subr, COUNT(1) AS lines 
      FROM `bigquery-samples.reddit.full` 
      GROUP BY 1, 2
      HAVING lines > 1
    ) 
    GROUP BY author
    HAVING subrs > 3
  )
),
subrs AS (
  SELECT author, subr 
  FROM `bigquery-samples.reddit.full` 
  WHERE author IN (SELECT author FROM authors)
  GROUP BY 1, 2
)
SELECT
  a.author AS author1, 
  b.author AS author2, 
  COUNTIF(a.subr = b.subr) AS count_intersection,
  COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

以类似的方式,您可以调整旧版SQL以使其正常工作

这可能不是最好的方法 - 但至少可以让这些任务有一些希望能够在BigQuery中轻松运行,而无需进行其他解决方法

答案 1 :(得分:0)

use std::fmt::Debug;

// This is an extension trait.
// You can force all its implementors to implement also some external trait,
// so that two trait bounds essentially collapse into one.
trait HelperTrait: Debug {
    fn helper_method(&mut self);
}

// And this is the "blanket" implementation,
// covering all the types necessary.
impl<T> HelperTrait for T where T: Debug {
    fn helper_method(&mut self) {
        println!("{:?}", self);
    }
}

一个人可以使用此功能进行参考。 谢谢。