Question

如何在spark中打破地图作业：

val rddFiltered = rdd.map(value => value.toInt).map(value => {

  if (value == 0)
  // if the condition is true , 
  // stop the map execution and return the processed RDD
  value
})

假设我的rdd是：3,4,7,1,3,0,4,6

我想要3,4,7,1,3

Answer 1

使用Spark无法有效解决问题。处理分区的顺序无法保证，也没有部分转换。

如果您希望预期的记录数量足够小以便驱动程序处理，您可以通过迭代take实现与runJob类似的方法，并且只要您赢了就收集分区找到第一个包含满足谓词值的值。

或者，以全数据扫描为代价：

def takeWhile[T : ClassTag](rdd: RDD[T])(pred: T => Boolean) = {
  /* Determine partition where the predicate is not satisfied
     for the first time */
  val i = rdd.mapPartitionsWithIndex((i, iter) => 
    if (iter.exists(!pred(_))) Iterator(i) else Iterator()
  ).min

  /* Process partitions dropping elements after the first one
     which doesn't satisfy the predicate */
  rdd.mapPartitionsWithIndex { 
    case (j, iter) if j < i => iter  // before i-th take all elements

    // i-th drop after the element of interest
    case (j, iter) if j == i => iter.takeWhile(pred) 

    case _ =>  Iterator() // after i-th drop all
  }
}


val rdd = sc.parallelize(Seq(3, 4, 7, 1, 3, 0, 4, 6))

takeWhile(rdd)(_ != 0).collect
// Array[Int] = Array(3, 4, 7, 1, 3)

takeWhile(rdd)(_ != 3).collect
// Array[Int] = Array()

takeWhile(rdd)(_ >= 2).collect
// Array[Int] = Array(3, 4, 7)

请注意，这不会修改分区数，因此如果不对其进行重新分区，可能会导致资源利用率不佳。

在一天结束时，将这种类型的顺序逻辑应用于Spark很少有意义。也不是说它不会修改分区的数量，所以

中止火花中的地图执行

1 个答案: