Question

我有一个DataFrame（d1）与(index, features)和第二个（d2）具有相同的列。

features是Seq[Double]，index是String。

d1大约有一百万行，d2可能介于40到10,000之间。

我想最后得到(index, CosineSimilarities) CosineSimilarities的数据框：对于每个d1行，Seq[Double]等于此行之间的余弦相似度以及d2的每一行。因此CosineSimilarities长度应等于d2行数。

我的第一种方法是使用DenseMatrix和IndexedRowMatrix以及d1.multiply(d2.transpose)。但是很难将结果映射回index，当d2变大时，任务中断。

我的第二种方法是：

d1
  .cartesian(d2)
  .repartition(n)
  .map { case ((d1index, d1features), (_, d2features)) =>
    (d1index, myCosineSimilarityMethod(d1features, d2features))
  }

但这很痛苦。

我的第三种方法是Broadcast d2并按行逐行：

d1
  .mapValues { d1features =>
      d2broadcasted
        .value
        .map { case (_, d2features) =>
          myCosineSimilarityMethod(d1features, d2features)
        }
        .toSeq
    )
  }

它起作用，它比approach2更具可扩展性和更快，但不如approach1快。

还有其他更好的方法吗？

修改

我有想法计算d2的中心，然后计算每个d1到此中心之间的距离。那会有用吗？有没有办法获得数据帧的质心？

如何有效地完成两个dataFrame之间的余弦相似性

0 个答案: