Question

tweetStream.foreachRDD((rdd, time) => {
  val count = rdd.count()
  if (count > 0) {
    var fileName =  outputDirectory + "/tweets_" + time.milliseconds.toString    
    val outputRDD = rdd.repartition(partitionsEachInterval) 
    outputRDD.saveAsTextFile(fileName) 
}

我试图以python方式检查流数据中的计数值或空RDD，找到方法，也尝试了以下链接中的示例。 http://spark.apache.org/docs/latest/streaming-programming-guide.html

Answer 1

RDD.isEmpty：

当且仅当RDD根本不包含任何元素时才返回true。

sc.range(0, 0).isEmpty()

True

sc.range(0, 1).isEmpty()

False

Answer 2

尝试使用以下代码段。

def process_rdd(rdd):
    print rdd.count()
    print("$$$$$$$$$$$$$$$$$$$$$$")
    streamrdd_to_df(rdd)

def empty_rdd():
    print "###The current RDD is empty. Wait for the next complete RDD ###"

clean.foreachRDD(lambda rdd: empty_rdd() if rdd.count() == 0 else process_rdd(rdd))

Answer 3

您只需使用RDD.isEmpty作为user6910411建议：

df.rdd.isEmpty()

返回布尔值。

如何检查PySpark中的空RDD

3 个答案: