Question

我在我的一个项目中使用spark mllib，我需要计算文档的相似性。

我首先使用mllib的tf-idf转换将文档转换为向量，然后将其转换为RowMatrix并使用columnSimilarities（）方法。

我提到了tf-idf文档，并使用DIMSUM实现了余弦相似度。

在spark-shell中，这是执行scala代码：

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()

val tf = hashingTF.transform(documents)
tf.cache()

val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm

val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix

现在让我们说这个代码块中的input file，test1是一个包含5个短文档的简单文件（每个少于10个术语），每行一个。

由于我只是在测试此代码，因此我希望看到mat.columnSimilarities()的输出位于对象sim中。我希望看到第一个文档向量与第二个，第三个等的相似性。

我提到CoordinateMatrix的spark documentation，它是columnSimilarities类的RowMatrix方法返回的对象类型，由sim引用。

通过浏览更多文档，我想我可以将CoordinateMatrix转换为RowMatrix，然后将RowMatrix的行转换为数组，然后像这样打印println(sim.toRowMatrix().rows.toArray().mkString("\n"))。

但这会产生一些我无法理解的输出。

有人可以帮忙吗？任何类型的资源链接等都会有很大的帮助！

谢谢！

Answer 1

您可以尝试以下操作，无需转换为行矩阵格式

val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}

要检索元素，您可以调用以下操作

transformedRDD.collect()

在Apache Spark中使用RowMatrix.columnSimilarities后打印CoordinateMatrix

1 个答案: