我试图计算Apache Spark中两个不同RDD之间的堪培拉距离。 RDD具有相同的尺寸并且不是特别大。
有没有人对在RDD API中执行此操作的最佳方法有任何建议?堪培拉距离的等式可以在下面的链接中看到。
答案 0 :(得分:1)
您需要为每个RDD创建一个索引,然后加入它们。然后,这将允许您执行地图以计算每对的距离,然后收集总和。
以下内容应该有效:
// I am assuming here that your vectors are initially stored as arrays of Double
val dataX = sc.parallelize(Array(11.0, 12.0, 13.0, 14.0, 15.0))
val dataY = sc.parallelize(Array(21.0, 22.0, 23.0, 24.0, 25.0))
def canberraDist(sc: SparkContext, X: RDD[Double], Y: RDD[Double]): Double ={
// Create an index based on length for each RDD.
// Index is added as second value so use map to switch order allowing join to work properly.
// This can be done in the join step, but added here for clarity.
val RDDX = X.zipWithIndex().map(x => (x._2,x._1))
val RDDY = Y.zipWithIndex().map(x => (x._2,x._1))
// Join the 2 RDDs on index value. Returns: RDD[(Long, (Double, Double))]
val RDDJoined = RDDX.map(x => (x._1,x._2)).join(RDDY.map(x => (x._1,x._2)))
// Calculate Canberra Distance
val distance = RDDJoined.map{case (id, (x,y)) => { ( math.abs(x - y) / (math.abs(x) + math.abs(y)) ) } }.reduce(_+_)
// Return Value
return distance
}
var totalDist = canberraDist(sc, dataX, dataY)
println(totalDist)