减少潜在的空RDD

时间:2015-12-10 20:45:02

标签: scala apache-spark

所以我遇到了一个问题,我在RDD上使用的过滤器可能会创建一个空的RDD。我觉得为了测试空虚而做一个count()会非常昂贵,并且想知道是否有更高效的方法来处理这种情况。

以下是此问题的示例:

    val b:RDD[String] = sc.parallelize(Seq("a","ab","abc"))


    println(b.filter(a => !a.contains("a")).reduce(_+_))

会给出结果

empty collection
java.lang.UnsupportedOperationException: empty collection
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1005)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)

有没有人对如何解决这个边缘案件有任何建议?

2 个答案:

答案 0 :(得分:8)

考虑.fold("")(_ + _)而不是.reduce(_ + _)

答案 1 :(得分:1)

怎么样

scala> val b = sc.parallelize(Seq("a","ab","abc"))
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at     parallelize at <console>:24

scala> b.isEmpty
res1: Boolean = false