通过R中的余弦相似性检索每行中的前k个相似行

时间:2015-09-26 00:57:11

标签: r similarity cosine-similarity

How to efficiently retrieve top K-similar vectors by cosine similarity using R?询问如何计算一个矩阵的每个向量相对于另一个矩阵的顶部相似向量。它是satisfactorily answered,我想调整它以在单个矩阵上运行。

也就是说,我希望矩阵中每行的前k个类似其他行。我怀疑解决方案非常相似,但可以进行优化。

1 个答案:

答案 0 :(得分:-1)

此功能基于链接的答案:

CosineSimilarities <- function(m, top.k) {
  # Computes cosine similarity between each row and all other rows in a matrix.
  #
  # Args:
  #   m: Matrix of values.
  #   top.k: Number of top rows to show for each row.
  #
  # Returns:
  #   Data frame with columns for pair of rows, and cosine similarity, for top
  #   `top.k` rows per row.
  #   
  # Similarity computation
  cp <- tcrossprod(m)
  mm <- rowSums(m ^ 2)
  result <- cp / sqrt(outer(mm, mm))
  # Top similar rows from train (per row)
  # Use `top.k + 1` to remove the self-reference (similarity = 1)
  top <- apply(result, 2, order, decreasing=TRUE)[seq(top.k + 1), ]
  result.df <- data.frame(row.id1=c(col(top)), row.id2=c(top))
  result.df$cosine.similarity <- result[as.matrix(result.df[, 2:1])]
  # Remove same-row records and return
  return(result.df[result.df$row.id1 != result.df$row.id2, ])
}

例如:

(m <- matrix(1:9, nrow=3))
#      [,1] [,2] [,3]
# [1,]    1    4    7
# [2,]    2    5    8
# [3,]    3    6    9
CosineSimilarities(m, 1)
#   row.id1 row.id2 cosine.similarity
# 2       1       2            0.9956
# 4       2       3            0.9977
# 6       3       2            0.9977