使用commitAsync提交偏移量时出现Kafka异常

时间:2019-01-26 23:41:45

标签: java apache-spark apache-kafka spark-streaming

我的Kafka应用程序读取实时流数据,将其处理并存储到Hive中。我正在尝试使用commitAsync来提交偏移量。 我收到以下异常消息:

  

由于:java.io.NotSerializableException:对象   org.apache.spark.streaming.kafka010.DirectKafkaInputDStream正在   序列化可能是RDD操作关闭的一部分。这是   因为DStream对象是从   关闭。请重写此DStream中的RDD操作,以避免   这个。已执行此操作以避免Spark任务膨胀   不必要的对象。

下面是我的代码的工作流程:

public void method1(SparkConf conf,String app) 
    spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
    final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
                new Duration(<spark duration>));
    JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
    messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
        @Override
        public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
                OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
                JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
                    @Override
                    public String call(ConsumerRecord<String, String> tuple2) throws Exception {
                        return tuple2.value();
                    }
                });

                records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
                    @Override
                    public void call(JavaRDD<String> rdd) throws Exception {
                        if(!rdd.isEmpty()) {
                            methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
                        }
                    }
                 });
                ((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
        }
    });
    javaStreamContext.start();
    javaStreamContext.awaitTermination();
}

赞赏任何建议。


以下代码在数据处理后有效并提交偏移量。 但是问题是,在以下情况下它正在处理重复项: 可以说-消费者作业正在运行,并且配置单元表具有0条记录,当前偏移量为(FORMAT- fromOffest,直到Offset,Difference): 512512 0 然后我产生了1000条记录,当它读取34条记录但没有提交时,我杀死了它 512546 34

我看到这一次,这34个记录已经加载到Hive表中

接下来,我重新启动了该应用程序。 我看到它再次读取了34条记录(而不是读取1000-34 = 76 recs),尽管它已经处理了它们并加载到Hive中 512 1512 1000 然后几秒钟后,它会更新。 1512 1512 0 蜂巢现在有(34 + 1000 = 1034)

这将导致表中的记录重复(额外34)。 如代码中所述,我仅在处理/加载到Hive表之后才提交偏移量。

请提出建议。

public void method1(SparkConf conf,String app) 
spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
            new Duration(<spark duration>));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
            LocationStrategies.PreferConsistent(),
            ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));

            JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
                @Override
                public String call(ConsumerRecord<String, String> tuple2) throws Exception {
                    return tuple2.value();
                }
            });

            records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
                @Override
                public void call(JavaRDD<String> rdd) throws Exception {
                    if(!rdd.isEmpty()) {
                        methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
                    }
                }
             });

             messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
              @Override
              public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
                    OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
                    ((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);                     
                    for (OffsetRange offset : offsetRanges) {
                        System.out.println(offset.fromOffset() + " " + offset.untilOffset()+ "  "+offset.count());
                    }
                     }
              });             
javaStreamContext.start();
javaStreamContext.awaitTermination();

}

2 个答案:

答案 0 :(得分:0)

尝试移动(((CanCommitOffsets)messages.inputDStream())。commitAsync(offsetRanges);在foreachRDD块之外

public void method1(SparkConf conf,String app) 
    spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
    final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
                new Duration(<spark duration>));
    JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
    messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
        @Override
        public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
                OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
                JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
                    @Override
                    public String call(ConsumerRecord<String, String> tuple2) throws Exception {
                        return tuple2.value();
                    }
                });

                records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
                    @Override
                    public void call(JavaRDD<String> rdd) throws Exception {
                        if(!rdd.isEmpty()) {
                            methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
                        }
                    }
                 });
        }
    });
     ((CanCommitOffsets)  messages.inputDStream()).commitAsync(offsetRanges);
    javaStreamContext.start();
    javaStreamContext.awaitTermination();
}

答案 1 :(得分:0)

以下代码有效。 但是我不确定在处理到蜂巢之后是否提交偏移量,因为commitAsync块在蜂巢存储方法调用之前。

public void method1(SparkConf conf,String app) 
spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
            new Duration(<spark duration>));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
            LocationStrategies.PreferConsistent(),
            ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
    @Override
    public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
            OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
    }
});
            JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
                @Override
                public String call(ConsumerRecord<String, String> tuple2) throws Exception {
                    return tuple2.value();
                }
            });

            records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
                @Override
                public void call(JavaRDD<String> rdd) throws Exception {
                    if(!rdd.isEmpty()) {
                        methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
                    }
                }
             });

javaStreamContext.start();
javaStreamContext.awaitTermination();

}

对于此代码,如果我添加以下代码块(仅在初始化offsetRanges之后)以打印偏移量详细信息,它将不再起作用,并抛出相同的异常

messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
              @Override
              public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {


                OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();

               rdd.foreachPartition(new VoidFunction<Iterator<ConsumerRecord<String,String>>>() {
                   @Override
                   public void call(Iterator<org.apache.kafka.clients.consumer.ConsumerRecord<String,String>> arg0) throws Exception {

                   OffsetRange o = offsetRanges[TaskContext.get().partitionId()];

                   System.out.println(o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset());
                   }
            });

                ((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);

              }
              });

请提供您的评论