Question

我有两个RDD：points和pointsWithinEps。 points中的每个点代表x, y坐标。 pointsWithinEps代表它们之间的两个点和距离：((x, y), distance)。我想循环所有点，并且每个点仅过滤pointsWithinEps中x（第一个）坐标中的元素。所以我做了以下事情：

    points.foreach(p =>
      val distances = pointsWithinEps.filter{
        case((x, y), distance) => x == p
      }
      if (distances.count() > 3) {
//        do some other actions
      }
    )

但是这种语法无效。据我所知，不允许在Spark foreach中创建变量。我应该这样做吗？

for (i <- 0 to points.count().toInt) {
  val p = points.take(i + 1).drop(i) // take the point
  val distances = pointsWithinEps.filter{
    case((x, y), distance) => x == p
  }
  if (distances.count() > 3) {
    //        do some other actions
  }
}

或者有更好的方法可以做到这一点？完整代码在此处托管：https://github.com/timasjov/spark-learning/blob/master/src/DBSCAN.scala

修改

points.foreach({ p =>
  val pointNeighbours = pointsWithinEps.filter {
    case ((x, y), distance) => x == p
  }
  println(pointNeighbours)
})

现在我有以下代码，但它会抛出NullPointerException（pointsWithinEps）。如何修复为什么pointsWithinEps为空（在foreach中有元素之前）？

Answer 1

为了收集在给定坐标上开始的所有距离点，一种简单的分布式方法是通过该坐标x对点进行关键，并按照该键对它们进行分组，如下所示：

val pointsWithinEpsByX = pointsWithinEps.map{case ((x,y),distance) => (x,((x,y),distance))}
val xCoordinatesWithDistance = pointsWithinEpsByX.groupByKey

然后左边连接点的RDD和前一个转换的结果：

val pointsWithCoordinatesWithDistance = points.leftOuterJoin(xCoordinatesWithDistance)

Answer 2

声明变量意味着你有一个块，而不仅仅是一个表达式，所以你需要使用大括号{}，例如

point.foreach({p => ... })

Spark foreach中的代码执行

2 个答案: