我使用Spark 1.6.0和Cloudera 5.8.3
我有一个DStream
对象,并在其上定义了大量的转换,
val stream = KafkaUtils.createDirectStream[...](...)
val mappedStream = stream.transform { ... }.map { ... }
mappedStream.foreachRDD { ... }
mappedStream.foreachRDD { ... }
mappedStream.map { ... }.foreachRDD { ... }
是否有办法注册保证最后执行的最后foreachRDD
且仅在上述foreachRDD
执行完毕后才会执行?
换句话说,当Spark UI显示作业已完成时 - 就在我想要执行轻量级函数时。
API中是否有允许我实现的内容?
由于
答案 0 :(得分:5)
使用流式监听器应该为您解决问题:
(抱歉,这是一个java示例)
ssc.addStreamingListener(new JobListener());
// ...
class JobListener implements StreamingListener {
@Override
public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted) {
System.out.println("Batch completed, Total delay :" + batchCompleted.batchInfo().totalDelay().get().toString() + " ms");
}
/*
snipped other methods
*/
}
答案 1 :(得分:1)
启动名称为myStreamName
的流,并等待其启动-
deltaStreamingQuery = (streamingDF
.writeStream
.format("delta")
.queryName(myStreamName)
.start(writePath)
)
untilStreamIsReady(myStreamName)
PySpark版本等待流启动:
def getActiveStreams():
try:
return spark.streams.active
except:
print("Unable to iterate over all active streams - using an empty set instead.")
return []
def untilStreamIsReady(name, progressions=3):
import time
queries = list(filter(lambda query: query.name == name, getActiveStreams()))
while (len(queries) == 0 or len(queries[0].recentProgress) < progressions):
time.sleep(5) # Give it a couple of seconds
queries = list(filter(lambda query: query.name == name, getActiveStreams()))
print("The stream {} is active and ready.".format(name))
Spark Scala版本等待流启动:
def getActiveStreams():Seq[org.apache.spark.sql.streaming.StreamingQuery] = {
return try {
spark.streams.active
} catch {
case e:Throwable => {
// In extream cases, this funtion may throw an ignorable error.
println("Unable to iterate over all active streams - using an empty set instead.")
Seq[org.apache.spark.sql.streaming.StreamingQuery]()
}
}
}
def untilStreamIsReady(name:String, progressions:Int = 3):Unit = {
var queries = getActiveStreams().filter(_.name == name)
while (queries.length == 0 || queries(0).recentProgress.length < progressions) {
Thread.sleep(5*1000) // Give it a couple of seconds
queries = getActiveStreams().filter(_.name == name)
}
println("The stream %s is active and ready.".format(name))
}
对原始问题..添加此功能的另一个版本-等待流首先启动,然后等待另一个时间(只需在等待状态上添加否定条件)使其完成,因此完整版本将看起来像这样-
untilStreamIsReady(myStreamName)
untilStreamIsDone(myStreamName) // reverse of untilStreamIsReady - wait when myStreamName will not be in the list