如何使用Spark Streaming检查rdd是否为空?

时间:2019-02-27 00:10:41

标签: python-3.x apache-spark pyspark spark-streaming

我有以下pyspark代码,用于从logs /目录读取日志文件,然后仅在其中包含数据时将结果保存到文本文件中,换句话说,当RDD不为空时。但是我在执行它时遇到了问题。我已经尝试了take(1)和notempty。由于这是dstream rdd,因此我们无法对其应用rdd方法。如果我有任何遗漏,请告诉我。

conf = SparkConf().setMaster("local").setAppName("PysparkStreaming")
sc = SparkContext.getOrCreate(conf = conf)

ssc = StreamingContext(sc, 3)   #Streaming will execute in each 3 seconds
lines = ssc.textFileStream('/Users/rocket/Downloads/logs/')  #'logs/ mean directory name
audit = lines.map(lambda x: x.split('|')[3])
result = audit.countByValue()
#result.pprint()
#result.foreachRDD(lambda rdd: rdd.foreach(sendRecord))
# Print the first ten elements of each RDD generated in this DStream to the console
if result.foreachRDD(lambda rdd: rdd.take(1)):
    result.pprint()
    result.saveAsTextFiles("/Users/rocket/Downloads/output","txt")
else:
    result.pprint()
    print("empty")

1 个答案:

答案 0 :(得分:0)

正确的结构应该是

 val textFragment = MyFragment()
 val mytostring = board_status_tv.getText().toString()
 val mArgs = Bundle()
 mArgs.putString(BOARDSTATE, mytostring)
 textFragment.setArguments(mArgs)

但是,如上所述,因为RDD API没有import uuid def process_batch(rdd): if not rdd.isEmpty(): result.saveAsTextFiles("/Users/rocket/Downloads/output-{}".format( str(uuid.uuid4()) ) ,"txt") result.foreachRDD(process_batch) 模式,所以每个批次都需要一个单独的目录。

另一种可能是:

append