使用Apache Spark计算堪培拉距离

时间:2016-06-30 16:09:11

标签: apache-spark

我试图计算Apache Spark中两个不同RDD之间的堪培拉距离。 RDD具有相同的尺寸并且不是特别大。

有没有人对在RDD API中执行此操作的最佳方法有任何建议?堪培拉距离的等式可以在下面的链接中看到。

Canberra Distance Between Two Vectors

1 个答案:

答案 0 :(得分:1)

您需要为每个RDD创建一个索引,然后加入它们。然后,这将允许您执行地图以计算每对的距离,然后收集总和。

以下内容应该有效:

// I am assuming here that your vectors are initially stored as arrays of Double
val dataX = sc.parallelize(Array(11.0, 12.0, 13.0, 14.0, 15.0))
val dataY = sc.parallelize(Array(21.0, 22.0, 23.0, 24.0, 25.0))

def canberraDist(sc: SparkContext, X: RDD[Double], Y: RDD[Double]):  Double ={      
  // Create an index based on length for each RDD.
  // Index is added as second value so use map to switch order allowing join to work properly.
  // This can be done in the join step, but added here for clarity.
  val RDDX = X.zipWithIndex().map(x => (x._2,x._1))
  val RDDY = Y.zipWithIndex().map(x => (x._2,x._1))

  // Join the 2 RDDs on index value. Returns: RDD[(Long, (Double, Double))]
  val RDDJoined = RDDX.map(x => (x._1,x._2)).join(RDDY.map(x => (x._1,x._2)))

  // Calculate Canberra Distance   
  val distance = RDDJoined.map{case (id, (x,y)) => { ( math.abs(x - y) / (math.abs(x) + math.abs(y)) ) } }.reduce(_+_)
  // Return Value
  return distance
}

var totalDist = canberraDist(sc, dataX, dataY)
println(totalDist)