在Spark Streaming中,有没有办法检测批处理何时完成?

时间:2017-02-01 12:59:33

标签: scala apache-spark spark-streaming cloudera

我使用Spark 1.6.0和Cloudera 5.8.3 我有一个DStream对象,并在其上定义了大量的转换,

val stream = KafkaUtils.createDirectStream[...](...)
val mappedStream = stream.transform { ... }.map { ... }
mappedStream.foreachRDD { ... }
mappedStream.foreachRDD { ... }
mappedStream.map { ... }.foreachRDD { ... }

是否有办法注册保证最后执行的最后foreachRDD且仅在上述foreachRDD执行完毕后才会执行? 换句话说,当Spark UI显示作业已完成时 - 就在我想要执行轻量级函数时。

API中是否有允许我实现的内容?

由于

2 个答案:

答案 0 :(得分:5)

使用流式监听器应该为您解决问题:

(抱歉,这是一个java示例)

ssc.addStreamingListener(new JobListener());

// ...

class JobListener implements StreamingListener {

 @Override
    public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted) {

        System.out.println("Batch completed, Total delay :" + batchCompleted.batchInfo().totalDelay().get().toString() +  " ms");

    }

   /*

   snipped other methods

   */


}

https://gist.github.com/akhld/b10dc491aad1a2007183

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-streaming/spark-streaming-streaminglisteners.html

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener

答案 1 :(得分:1)

启动名称为myStreamName的流,并等待其启动-

deltaStreamingQuery = (streamingDF
  .writeStream
  .format("delta")
  .queryName(myStreamName)
  .start(writePath)
)

untilStreamIsReady(myStreamName) 

PySpark版本等待流启动:

def getActiveStreams():
  try:
    return spark.streams.active
  except:
    print("Unable to iterate over all active streams - using an empty set instead.")
    return []

def untilStreamIsReady(name, progressions=3):
  import time
  queries = list(filter(lambda query: query.name == name, getActiveStreams()))

  while (len(queries) == 0 or len(queries[0].recentProgress) < progressions):
    time.sleep(5) # Give it a couple of seconds
    queries = list(filter(lambda query: query.name == name, getActiveStreams()))

  print("The stream {} is active and ready.".format(name))

Spark Scala版本等待流启动:

def getActiveStreams():Seq[org.apache.spark.sql.streaming.StreamingQuery] = {
  return try {
    spark.streams.active
  } catch {
    case e:Throwable => {
      // In extream cases, this funtion may throw an ignorable error.
      println("Unable to iterate over all active streams - using an empty set instead.")
      Seq[org.apache.spark.sql.streaming.StreamingQuery]()
    }
  }
}

def untilStreamIsReady(name:String, progressions:Int = 3):Unit = {
  var queries = getActiveStreams().filter(_.name == name)

  while (queries.length == 0 || queries(0).recentProgress.length < progressions) {
    Thread.sleep(5*1000) // Give it a couple of seconds
    queries = getActiveStreams().filter(_.name == name)
  }
  println("The stream %s is active and ready.".format(name))
}

对原始问题..添加此功能的另一个版本-等待流首先启动,然后等待另一个时间(只需在等待状态上添加否定条件)使其完成,因此完整版本将看起来像这样-

untilStreamIsReady(myStreamName) 
untilStreamIsDone(myStreamName)   // reverse of untilStreamIsReady - wait when myStreamName will not be in the list