我正在尝试使用Spark 1.6和Kafka运行有状态的流作业,以下是我的Spark作业的结构。
public class test {
public static void main(final String[] args) {
final String checkPointDir="/test/chkDir";
Function0<JavaStreamingContext> createContextFun=new Function0<JavaStreamingContext>() {
private static final long serialVersionUID = 1L;
@Override
public JavaStreamingContext call() throws Exception {
return createContext(checkPointDir,args);
}
};
JavaStreamingContext ssc =JavaStreamingContext.getOrCreate(checkPointDir, createContextFun) ;
ssc.start();
ssc.awaitTermination();
}
protected static JavaStreamingContext createContext(String checkPointDir,String[] vals) {
// initialize spark configuration
SparkConf sparkConfiguration = new SparkConf().set("spark.streaming.receiver.writeAheadLog.enable", "true");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConfiguration);
JavaStreamingContext streamingSparkContext = new JavaStreamingContext(sparkContext,new Duration(batchInterval));
final HiveContext hiveContext = new HiveContext(sparkContext);
// Pulling data from Kafka creating a Stream
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(streamingSparkContext,
String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topics);
directKafkaStream.foreachRDD(---);
// Applying Business transformation on stream data and joining for enriching
JavaDStream<SMessage> enriched = directKafkaStream.map(--);
JavaPairDStream<String, SMessage> pairSwitchMsg = enriched.mapToPair(--);
Function3<String, Optional<SMessage>, State<dingState>, Tuple2<String, dingState>> mappingFunc---
JavaMapWithStateDStream<String, SMessage, dingState, Tuple2<String, dingState>> sWithState = pairSwitchMsg.mapWithState(StateSpec.function(mappingFunc).initialState(crnStateMap));
streamingSparkContext.checkpoint(checkPointDir);
return streamingSparkContext;
}
}
该作业在第一次运行中运行良好,但是如果我终止该作业并再次开始,它已经花费了13多个小时,并且仍在运行以重新计算检查点数据,因此不会消耗任何新消息。有人可以帮助我了解是否缺少某些东西。