失败后读取Spark Streaming检查点

时间:2017-12-18 13:49:49

标签: apache-kafka apache-spark-sql spark-streaming

我正在尝试使用Kafka应用程序实现Spark Streaming,包括容错。当我重新启动应用程序时,它会读取重启前已读取的消息,并且我的计算出错了。请帮我解决这个问题。

以下是用Java编写的代码。

public static JavaStreamingContext createContextFunc() {

    SummaryOfTransactionsWithCheckpoints app = new SummaryOfTransactionsWithCheckpoints();

    ApplicationConf conf = new ApplicationConf();
    String checkpointDir = conf.getCheckpointDirectory();

    JavaStreamingContext streamingContext =  app.getStreamingContext(checkpointDir);

    JavaDStream<String> kafkaInputStream = app.getKafkaInputStream(streamingContext);

    return streamingContext;
}


public static void main(String[] args) throws InterruptedException {

    String checkpointDir = conf.getCheckpointDirectory();

    Function0<JavaStreamingContext> createContextFunc = () -> createContextFunc();
    JavaStreamingContext streamingContext = JavaStreamingContext.getOrCreate(checkpointDir, createContextFunc);

    streamingContext.start();
    streamingContext.awaitTermination();

}

public JavaStreamingContext getStreamingContext(String checkpointDir) {

    ApplicationConf conf = new ApplicationConf();
    String appName = conf.getAppName();
    String master = conf.getMaster();
    int duration = conf.getDuration();

    SparkConf sparkConf = new SparkConf().setAppName(appName).setMaster(master);
    sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true");

    JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, new Duration(duration));
    streamingContext.checkpoint(checkpointDir);

    return streamingContext;
}

public SparkSession getSession() {

    ApplicationConf conf = new ApplicationConf();
    String appName = conf.getAppName();
    String hiveConf = conf.getHiveConf();
    String thriftConf =  conf.getThriftConf();
    int shufflePartitions = conf.getShuffle();

    SparkSession spark = SparkSession
            .builder()
            .appName(appName)
            .config("spark.sql.warehouse.dir", hiveConf)
            .config("hive.metastore.uris", thriftConf)
            .enableHiveSupport()
            .getOrCreate();

    spark.conf().set("spark.sql.shuffle.partitions", shufflePartitions);
    return spark;

}


public JavaDStream<String> getKafkaInputStream(JavaStreamingContext streamingContext) {

    KafkaConfig kafkaConfig = new KafkaConfig();
    Set<String> topicsSet = kafkaConfig.getTopicSet();
    Map<String, Object> kafkaParams = kafkaConfig.getKafkaParams();

    // Create direct kafka stream with brokers and topics
    JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
            streamingContext,
            LocationStrategies.PreferConsistent(),
            ConsumerStrategies.Subscribe(topicsSet, kafkaParams));

    JavaDStream<String> logdata = messages.map(ConsumerRecord::value);

    return logdata;
}

这是github项目的链接。 https://github.com/ThisaST/Spark-Fault-Tolerance

1 个答案:

答案 0 :(得分:1)

我已通过在代码中添加以下配置来解决此问题。

sparkConf.set(“spark.streaming.stopGracefullyOnShutdown","true")