源数据:
scala> dataframe.show
+--------------------+--------------------+
| moid| features|
+--------------------+--------------------+
|0031222c889642608...|(5,[0,1,2,3,4],[0...|
|0013103228494a7b9...|(5,[0,2,3,4],[0.1...|
|003e1996e51a435e8...|(5,[0,2,3,4],[0.2...|
|0044b270064342ac8...|(5,[0,1,2,3,4],[0...|
|00b36594a2a644f09...|(5,[0,1,2,3,4],[0...|
|00e8387be566492c9...|(5,[0,1,2,3,4],[0...|
|01158f88e19148b39...|(5,[0,1,3,4],[0.1...|
|011952d6c52b43019...|(5,[0,1,2,3,4],[0...|
|0156b479932b449bb...|(5,[0,1,2,3,4],[0...|
|015fb90315cc43b19...|(5,[0,1,2,3,4],[0...|
|0186aa87f3f04d1d8...|(5,[0,1,2,4],[0.2...|
|019bc8d4096e41ad8...|(5,[0,1,3,4],[0.4...|
|0224ed4d3d5d4a3ca...|(5,[0,1,2,3,4],[0...|
|0279fd0bb2f2458ba...|(5,[0,1,2,3,4],[0...|
|02847207432d4de9a...|(5,[0,1,2,4],[0.2...|
|028715c44bac423f8...|(5,[1,2,4],[0.243...|
|02ccf2c118a046e69...|(5,[1,2,4],[0.243...|
|005a55b9a230452b9...|(5,[0,2,3,4],[0.2...|
|02e02d27ce13448db...|(5,[0,1,2,3,4],[0...|
|013150a3c5fc42d88...|(5,[0,1,2,4],[0.1...|
+--------------------+--------------------+
scala> dataframe.printSchema
root
|-- moid: string (nullable = false)
|-- features: vector (nullable = true)
vector :org.apache.spark.ml.linalg.SparseVector
我想计算每一行之间的余弦相似度,然后按相似性得到每一行的前十项,最后得到&#39; top_sim_map&#39; 。< /强>
val top_sim_map = Map[String,Array[(String,Double)]]()
这是我做的:
def cosineSimilarity(vectorA: org.apache.spark.ml.linalg.SparseVector, vectorB: org.apache.spark.ml.linalg.SparseVector):Double = {
var dotProduct = 0.0
var normA = 0.0
var normB = 0.0
var index = vectorA.size - 1
for (i <- 0 to index) {
dotProduct += vectorA(i) * vectorB(i)
normA += Math.pow(vectorA(i), 2)
normB += Math.pow(vectorB(i), 2)
}
(dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)))
}
val rddData = dataframe.rdd
val rddDataLocal = rddData.collect()
val br_rddDataLocal = spark.sparkContext.broadcast(rddDataLocal)
val top_sim_map = Map[String,Array[(String,Double)]]()
rddData.foreach((r:Row)=>{
val moid = r.getString(0)
val vec_a = r.getAs[org.apache.spark.ml.linalg.SparseVector](1)
var simArr:Array[(String,Double)] = Array(("0",0.0),("0",0.0),("0",0.0),("0",0.0),("0",0.0), ("0",0.0),("0",0.0),("0",0.0),("0",0.0),("0",0.0))
br_rddDataLocal.value.foreach((row_tg:Row)=>{
val num_b:String = row_tg.getString(0)
val vec_b = row_tg.getAs[org.apache.spark.ml.linalg.SparseVector](1)
val sim:Double = cosineSimilarity(vec_a,vec_b)
simArr = simArr.map((t)=>{
if(simArr.min._2>sim) (num_b,sim) else t
})
})
top_sim_map += {moid->simArr}
})
我的问题是为什么 top_sim_map 为空?
scala> top_sim_map.size
res36: Int = 0
scala> top_sim_map.isEmpty
res37: Boolean = true
scala> top_sim_map.take(100).foreach(println)
scala>