Question

我正在使用Scala 2.12.7和Spark 2.4.3，我有一种算法可以使用RDD[(List[Double], String)]

def classifyPoint(point: List[Double], data: RDD[(List[Double], String)], k: Int): String =
  {
    val sortedDistances = data.map{case (a, b) => (b, Util.euclideanDistance(point, a))}.sortBy(_._2, ascending = true)

    val topk = sortedDistances.zipWithIndex().filter(_._2 < k)

    val result = topk.map(_._1).map(entry => (entry._1, 1)).reduceByKey(_+_).sortBy(_._2, ascending = false).first()._1

    //for debugging purposes, remember to remove
    println(s"Point classified as ${result}")

    result
  }

使用名为RDD[List[Double]]的{{1}}，我想使用此行并行计算分类过程（先前描述的函数classifyPoint在一个类中），

testVector

其中val classificationKnn = testVector.map(knnSpark.classifyPoint(_, modelKnn, k))是modelKnn

但是，这种方法给我一个错误“此RDD缺少SparkContext”，据我了解，该错误与嵌套的RDD操作有关。

有没有办法避免这个问题并且仍然能够进行并行计算？

如果我用RDD[(List[Double], String)]变换列表中的testVector，我将不再遇到相同的问题，但这也意味着放弃了对每个点进行并行分类的可能性。

如何解决SPARK-5063问题“此RDD缺少SparkContext”？

0 个答案: