计算Bigquery中大量向量之间的成对余弦相似度

时间:2018-12-28 04:58:24

标签: google-bigquery cosine-similarity standard-sql

我有一个表id_vectors,其中包含id及其对应的coordinates。每个coordinates都是一个重复字段,其中包含512个元素。

我正在寻找所有这些向量之间的成对余弦相似性,例如如果我有三个ids 1,2和3,那么我正在寻找一个表,它们之间具有余弦相似性(基于使用512坐标的计算),如下所示:

id1   id2   similarity
 1     2      0.5
 1     3      0.1
 2     3      0.99

现在我的表格中有424,970个唯一的ID及其对应的512维坐标。这意味着基本上我需要创建大约(424970 * 424969/2)唯一的ID对,并计算它们的相似性。

我首先尝试使用reference from here进行以下查询:

#standardSQL
with pairwise as
(SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id)

SELECT id_1, id_2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM pairwise

但是运行6个小时后,我遇到了以下错误消息 Query exceeded resource limits. 2.2127481953201417E7 CPU seconds were used, and this query must use less than 428000.0 CPU seconds.

然后,我想而不是使用中间表pairwise,为什么不先尝试创建该表,然后再进行余弦相似度计算。

所以我尝试了以下查询:

SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id

但是这次查询无法完成,我遇到了以下消息: Error: Quota exceeded: Your project exceeded quota for total shuffle size limit. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

然后,我尝试通过创建ID的组合对并使用以下查询从其剥离坐标来创建甚至更小的表:

SELECT t1.id as id_1, t2.id as id_2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id

我的查询再次以错误消息Query exceeded resource limits. 610104.3843576935 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)

结尾

我完全理解这是一个巨大的查询,而我的出发点是我的帐单配额。

我要问的是,有没有一种方法可以以更智能的方式执行查询,以使我不超过resourceLimitshuffleSizeLimitbillingTierLimit中的任何一个?< / p>

1 个答案:

答案 0 :(得分:1)

一个简单的想法是-与其将表与多余的坐标连接起来,而是应该创建简单的对表(id1,id2),然后通过两个额外的连接来“修饰”各自的id及其坐标向量到dataset.table.id_vectors

下面是一个简短的示例:

#standardSQL
WITH pairwise AS (
  SELECT t1.id AS id_1, t2.id AS id_2
  FROM `project.dataset.id_vectors` t1
  INNER JOIN `project.dataset.id_vectors` t2
  ON t1.id < t2.id
)
SELECT id_1, id_2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(a.coords) value1 WITH OFFSET pos1 
  JOIN UNNEST(b.coords) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2

很明显,它适用于小型虚拟集,如下所示:

#standardSQL
WITH `project.dataset.id_vectors` AS (
  SELECT 1 id, [1.0, 2.0, 3.0, 4.0] coords UNION ALL
  SELECT 2, [1.0, 2.0, 3.0, 4.0] UNION ALL
  SELECT 3, [2.0, 0.0, 1.0, 1.0] UNION ALL
  SELECT 4, [0, 2.0, 1.0, 1.0] UNION ALL 
  SELECT 5, [2.0, 1.0, 1.0, 0.0] UNION ALL
  SELECT 6, [1.0, 1.0, 1.0, 1.0]
), pairwise AS (
  SELECT t1.id AS id_1, t2.id AS id_2
  FROM `project.dataset.id_vectors` t1
  INNER JOIN `project.dataset.id_vectors` t2
  ON t1.id < t2.id
)
SELECT id_1, id_2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(a.coords) value1 WITH OFFSET pos1 
  JOIN UNNEST(b.coords) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2

有结果

Row id_1    id_2    cosine_similarity    
1   1       2       1.0  
2   1       3       0.6708203932499369   
3   1       4       0.819891591749923    
4   1       5       0.521749194749951    
5   1       6       0.9128709291752769   
6   2       3       0.6708203932499369   
7   2       4       0.819891591749923    
8   2       5       0.521749194749951    
9   2       6       0.9128709291752769   
10  3       4       0.3333333333333334   
11  3       5       0.8333333333333335   
12  3       6       0.8164965809277261   
13  4       5       0.5000000000000001   
14  4       6       0.8164965809277261   
15  5       6       0.8164965809277261     

因此,请尝试使用您的真实数据,让我们看看它如何为您工作:o)

而且...显然,您应该预先创建/实现pairwise

另一个优化想法是在您的SQRT(SUM(value1 * value1))中预先计算project.dataset.id_vectors的值-这样可以节省大量CPU-这应该是简单的调整,所以我留给您:o)