我需要计算一行的列之间的相似性,并尝试使用columnsimilarities()方法来获得结果。
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("CollarberativeFilter").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
SparkSession spark = SparkSession.builder().appName("CollarberativeFilter").getOrCreate();
double[][] array = {{5,0,5}, {0,10,0}, {5,0,5}};
LinkedList<Vector> rowsList = new LinkedList<Vector>();
for (int i = 0; i < array.length; i++) {
Vector currentRow = Vectors.dense(array[i]);
rowsList.add(currentRow);
}
JavaRDD<Vector> rows = sc.parallelize(rowsList);
// Create a RowMatrix from JavaRDD<Vector>.
RowMatrix mat = new RowMatrix(rows.rdd());
CoordinateMatrix simsPerfect = mat.columnSimilarities();
RowMatrix mat2 = simsPerfect.toRowMatrix();
List<Vector> vs2 = mat2.rows().toJavaRDD().collect();
List<Vector> vs = mat.rows().toJavaRDD().collect();
System.out.println("mat");
for(Vector v: vs) {
System.out.println(v);
}
System.out.println("mat2");
for(Vector v: vs2) {
System.out.println(v);
}
JavaRDD<MatrixEntry> entries = simsPerfect.entries().toJavaRDD();
JavaRDD<String> output = entries.map(new Function<MatrixEntry, String>() {
public String call(MatrixEntry e) {
return String.format("%d,%d,%s", e.i(), e.j(), e.value());
}
});
output.saveAsTextFile("resources123/data.txt");
}
但是
文本文件中的输出为0,2,0.9999999999999998
接下来,我使用double[][] array = {{1,3}, {2,7}};
尝试了相同的示例
那么
文本文件的输出为0,1,0.9982743731749959
有人可以解释我的答案格式。我不能得到矩阵的每一列对的分数。比如3乘3矩阵我需要3个分数来确定1,2列之间的相似度,2 ,3列,3,1列。 任何帮助表示赞赏。
答案 0 :(得分:3)
列相似度的计算方法如下Cosine Similarity:
由于您包含scala
标记,我将作弊并重复您在Scala REPL中所执行的操作:
scala> import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.linalg.{Vectors, Vector}
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val matVec = Vector(Vectors.dense(5,0,5), Vectors.dense(0,10,0), Vectors.dense(5,0,5))
matVec: scala.collection.immutable.Vector[org.apache.spark.mllib.linalg.Vector] = Vector([5.0,0.0,5.0], [0.0,10.0,0.0], [5.0,0.0,5.0])
scala> val matRDD = sc.parallelize(matVec)
matRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[44] at parallelize at <console>:37
scala> val myRowMat = new RowMatrix(matRDD)
myRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@7a7a07c2
scala> myRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,2,0.9999999999999998)
此输出表示(row0
,col2
)只有一个非零条目。因此,实际(上三角)输出是:
0 0 .9999
0 0 0
0 0 0
您期望的是什么(因为col0
和col1
之间的点积为零且col1
和col2
之间的点积为零)
以下是一个稀疏列相似性矩阵的示例:
scala> def randVec(len: Int) : org.apache.spark.mllib.linalg.Vector =
| Vectors.dense(Array.fill(len)(Random.nextDouble))
randVec: (len: Int)org.apache.spark.mllib.linalg.Vector
scala> val randRDD = sc.parallelize(Seq.fill(3)(randVec(4))
randRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[123] at parallelize at <console>:38
scala> val randRowMat = new RowMatrix(randRDD)
randRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@77d9112e
scala> randRowMat.rows.collect.foreach{println}
[0.11049508671100228,0.6560383649078886,0.08647831963379027,0.918734774579884]
[0.5709766390994561,0.5404121150599919,0.8206115742925799,0.12848224469499103]
[0.5414651842028494,0.26273347471310016,0.3139446375461201,0.351113866208812]
scala> randRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,3,0.4630854334046888)
MatrixEntry(0,2,0.9238294198864545)
MatrixEntry(2,3,0.33700154742702093)
MatrixEntry(0,1,0.7402725425024911)
MatrixEntry(1,2,0.7418690274112878)
MatrixEntry(1,3,0.8662504236158493)
代表以下矩阵:
0 0.74027 0.92382 0.46308
0 0 0.74186 0.86625
0 0 0 0.33700
0 0 0 0