与BigQuery SQL的余弦相似度?

时间:2017-12-04 05:34:59

标签: sql vector google-bigquery

我有存储在BigQuery中的向量(请参阅How can I compute TF/IDF with SQL (BigQuery)),我想找到它们之间最相似的。如何使用BigQuery标准SQL计算余弦相似度?

1 个答案:

答案 0 :(得分:3)

此查询查看为每个doc定义的向量,给定其维度(按字词)并将它们与余弦距离公式相乘:

#standardSQL
SELECT ANY_VALUE(title2) orig, ANY_VALUE(tf2id) id_orig, a.id id_similar 
  , ROUND(SAFE_DIVIDE( SUM(b.tf_idf * IFNULL(c.tf_idf,0)),(SQRT(SUM(b.tf_idf*b.tf_idf))*SQRT(SUM(POW(IFNULL(c.tf_idf,0),2))))),4) distance
  , ANY_VALUE(title1) similar
  , ARRAY_AGG((ROUND(b.tf_idf,4), ROUND(c.tf_idf,4))) weights
  , ARRAY_AGG((b.word, c.word)) words
FROM (
  SELECT id, tfidfs tf1, tf2, tf2id
  , a.title title1
  , b.title title2
  FROM `fh-bigquery.stackoverflow.tf_idf_experiment_3` a
  CROSS JOIN (
    SELECT tfidfs tf2, id tf2id, title 
    FROM `fh-bigquery.stackoverflow.tf_idf_experiment_3`
    WHERE id = 11353679 
    LIMIT 1
  ) b
) a
, UNNEST(tf1) b LEFT JOIN UNNEST(tf2) c ON b.word=c.word
GROUP BY id
ORDER BY distance DESC

第一个结果是同一个文件,证明我们自己得到距离1:

enter image description here

第二个结果:

enter image description here

ETC:

enter image description here

警告:这个SQL代码执行LEFT JOIN,因此我们只获得左侧文档中不在右侧的单词的空值,而不是相反的。