Question

RDD上没有isEmpty方法，那么如果RDD为空，最有效的测试方法是什么？

Answer 1

RDD.isEmpty()将成为Spark 1.3.0的一部分。

根据this apache mail-thread中的建议以及稍后对此答案的一些评论，我做了一些小型的本地实验。最好的方法是使用take(1).length==0。

def isEmpty[T](rdd : RDD[T]) = {
  rdd.take(1).length == 0 
}

它应该在O(1)中运行，除非RDD为空，在这种情况下它是分区数量的线性。

感谢Josh Rosen和Nick Chammas指出这一点。

注意：如果RDD的类型为RDD[Nothing]，则会失败，例如isEmpty(sc.parallelize(Seq()))，但这在现实生活中可能不是问题。 isEmpty(sc.parallelize(Seq[Any]()))工作正常。

编辑：

修改1：添加了take(1)==0方法，感谢评论。

我的原始建议：使用mapPartitions。

def isEmpty[T](rdd : RDD[T]) = {
  rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_) 
}

它应该按分区数量进行扩展，并且不像take(1)那样干净。然而，它对RDD[Nothing]类型的RDD很有用。

实验

我将此代码用于计时。

def time(n : Long, f : (RDD[Long]) => Boolean): Unit = {
  val start = System.currentTimeMillis()
  val rdd = sc.parallelize(1L to n, numSlices = 100)
  val result = f(rdd)
  printf("Time: " + (System.currentTimeMillis() - start) + "   Result: " + result)
}

time(1000000000L, rdd => rdd.take(1).length == 0L)
time(1000000000L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_))
time(1000000000L, rdd => rdd.count() == 0L)
time(1000000000L, rdd => rdd.takeSample(true, 1).isEmpty)
time(1000000000L, rdd => rdd.fold(0)(_ + _) == 0L)

time(1L, rdd => rdd.take(1).length == 0L)
time(1L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_))
time(1L, rdd => rdd.count() == 0L)
time(1L, rdd => rdd.takeSample(true, 1).isEmpty)
time(1L, rdd => rdd.fold(0)(_ + _) == 0L)

time(0L, rdd => rdd.take(1).length == 0L)
time(0L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_))
time(0L, rdd => rdd.count() == 0L)
time(0L, rdd => rdd.takeSample(true, 1).isEmpty)
time(0L, rdd => rdd.fold(0)(_ + _) == 0L)

在具有3个工作核心的本地计算机上，我得到了这些结果

Time:    21   Result: false
Time:    75   Result: false
Time:  8664   Result: false
Time: 18266   Result: false
Time: 23836   Result: false

Time:   113   Result: false
Time:   101   Result: false
Time:    68   Result: false
Time:   221   Result: false
Time:    46   Result: false

Time:    79   Result: true
Time:    93   Result: true
Time:    79   Result: true
Time:   100   Result: true
Time:    64   Result: true

Answer 2

从Spark 1.3开始，mlock(2)是RDD api的一部分。导致isEmpty()失败的修复程序稍后在Spark 1.4中修复。

对于DataFrame，您可以这样做：

isEmpty

以下是RDD实现中的代码粘贴（从1.4.1开始）。

val df: DataFrame = ...
df.rdd.isEmpty()

Spark：测试RDD是否为空的有效方法

2 个答案:

编辑：

实验