我的Kafka应用程序读取实时流数据,将其处理并存储到Hive中。我正在尝试使用commitAsync
来提交偏移量。
我收到以下异常消息:
由于:java.io.NotSerializableException:对象 org.apache.spark.streaming.kafka010.DirectKafkaInputDStream正在 序列化可能是RDD操作关闭的一部分。这是 因为DStream对象是从 关闭。请重写此DStream中的RDD操作,以避免 这个。已执行此操作以避免Spark任务膨胀 不必要的对象。
下面是我的代码的工作流程:
public void method1(SparkConf conf,String app)
spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
new Duration(<spark duration>));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
@Override
public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
@Override
public String call(ConsumerRecord<String, String> tuple2) throws Exception {
return tuple2.value();
}
});
records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) throws Exception {
if(!rdd.isEmpty()) {
methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
}
}
});
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
}
});
javaStreamContext.start();
javaStreamContext.awaitTermination();
}
赞赏任何建议。
以下代码在数据处理后有效并提交偏移量。 但是问题是,在以下情况下它正在处理重复项: 可以说-消费者作业正在运行,并且配置单元表具有0条记录,当前偏移量为(FORMAT- fromOffest,直到Offset,Difference): 512512 0 然后我产生了1000条记录,当它读取34条记录但没有提交时,我杀死了它 512546 34
我看到这一次,这34个记录已经加载到Hive表中
接下来,我重新启动了该应用程序。 我看到它再次读取了34条记录(而不是读取1000-34 = 76 recs),尽管它已经处理了它们并加载到Hive中 512 1512 1000 然后几秒钟后,它会更新。 1512 1512 0 蜂巢现在有(34 + 1000 = 1034)
这将导致表中的记录重复(额外34)。 如代码中所述,我仅在处理/加载到Hive表之后才提交偏移量。
请提出建议。
public void method1(SparkConf conf,String app)
spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
new Duration(<spark duration>));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
@Override
public String call(ConsumerRecord<String, String> tuple2) throws Exception {
return tuple2.value();
}
});
records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) throws Exception {
if(!rdd.isEmpty()) {
methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
}
}
});
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
@Override
public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
for (OffsetRange offset : offsetRanges) {
System.out.println(offset.fromOffset() + " " + offset.untilOffset()+ " "+offset.count());
}
}
});
javaStreamContext.start();
javaStreamContext.awaitTermination();
}
答案 0 :(得分:0)
尝试移动(((CanCommitOffsets)messages.inputDStream())。commitAsync(offsetRanges);在foreachRDD块之外
public void method1(SparkConf conf,String app)
spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
new Duration(<spark duration>));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
@Override
public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
@Override
public String call(ConsumerRecord<String, String> tuple2) throws Exception {
return tuple2.value();
}
});
records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) throws Exception {
if(!rdd.isEmpty()) {
methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
}
}
});
}
});
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
javaStreamContext.start();
javaStreamContext.awaitTermination();
}
答案 1 :(得分:0)
以下代码有效。 但是我不确定在处理到蜂巢之后是否提交偏移量,因为commitAsync块在蜂巢存储方法调用之前。
public void method1(SparkConf conf,String app)
spark = SparkSession.builder().appName(conf.get("")).enableHiveSupport().getOrCreate();
final JavaStreamingContext javaStreamContext = new JavaStreamingContext(context,
new Duration(<spark duration>));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(javaStreamContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String> Subscribe(<topicnames>, <kafka Params>));
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
@Override
public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
}
});
JavaDStream<String> records = messages.map(new Function<ConsumerRecord<String, String>, String>() {
@Override
public String call(ConsumerRecord<String, String> tuple2) throws Exception {
return tuple2.value();
}
});
records.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) throws Exception {
if(!rdd.isEmpty()) {
methodToSaveDataInHive(rdd, <StructTypeSchema>,<OtherParams>);
}
}
});
javaStreamContext.start();
javaStreamContext.awaitTermination();
}
对于此代码,如果我添加以下代码块(仅在初始化offsetRanges之后)以打印偏移量详细信息,它将不再起作用,并抛出相同的异常
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
@Override
public void call(JavaRDD<ConsumerRecord<String, String>> rdd) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.foreachPartition(new VoidFunction<Iterator<ConsumerRecord<String,String>>>() {
@Override
public void call(Iterator<org.apache.kafka.clients.consumer.ConsumerRecord<String,String>> arg0) throws Exception {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset());
}
});
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
}
});
请提供您的评论