重新启动后,Spark 1.6有状态流作业需要很长时间才能恢复

时间:2018-07-19 05:48:19

标签: apache-spark pyspark apache-spark-sql spark-streaming

我正在尝试使用Spark 1.6和Kafka运行有状态的流作业,以下是我的Spark作业的结构。

public class test {

            public static void main(final String[] args) {

                            final String checkPointDir="/test/chkDir";

                            Function0<JavaStreamingContext> createContextFun=new Function0<JavaStreamingContext>() {
                                            private static final long serialVersionUID = 1L;
                                            @Override
                                            public JavaStreamingContext call() throws Exception {

                                                            return createContext(checkPointDir,args);
                                            }                                              
                            };

                            JavaStreamingContext ssc =JavaStreamingContext.getOrCreate(checkPointDir, createContextFun)         ;
                            ssc.start();
                            ssc.awaitTermination();

            }


            protected static JavaStreamingContext createContext(String checkPointDir,String[] vals) {
                            // initialize spark configuration
                            SparkConf sparkConfiguration = new SparkConf().set("spark.streaming.receiver.writeAheadLog.enable", "true");

                            JavaSparkContext sparkContext = new JavaSparkContext(sparkConfiguration);
                            JavaStreamingContext streamingSparkContext = new JavaStreamingContext(sparkContext,new Duration(batchInterval));


                            final HiveContext hiveContext = new HiveContext(sparkContext);

                            // Pulling data from Kafka creating a Stream
                            JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(streamingSparkContext,
                                                            String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topics);

                            directKafkaStream.foreachRDD(---);

                            // Applying Business transformation on stream data and joining for enriching
                            JavaDStream<SMessage> enriched = directKafkaStream.map(--);


                            JavaPairDStream<String, SMessage> pairSwitchMsg = enriched.mapToPair(--);

                            Function3<String, Optional<SMessage>, State<dingState>, Tuple2<String, dingState>> mappingFunc---



                            JavaMapWithStateDStream<String, SMessage, dingState, Tuple2<String, dingState>> sWithState = pairSwitchMsg.mapWithState(StateSpec.function(mappingFunc).initialState(crnStateMap));





                            streamingSparkContext.checkpoint(checkPointDir);
                            return streamingSparkContext;

            }

}

该作业在第一次运行中运行良好,但是如果我终止该作业并再次开始,它已经花费了13多个小时,并且仍在运行以重新计算检查点数据,因此不会消耗任何新消息。有人可以帮助我了解是否缺少某些东西。

0 个答案:

没有答案