我使用下面的代码来读取Kafka主题,并处理数据。
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(new Function<JavaRDD<Row>, JavaRDD<Row>>() {
//JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
StructType schema = DataTypes.createStructType(fields);
public JavaRDD<Row> call(JavaRDD<Row> rdd) throws Exception {
records = rdd.union(records);
return rdd;
}
});
transformedMessages.foreachRDD(record -> {
//System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("trades");
System.out.println(ds.count());
ds.show();
});
在运行代码时,我遇到异常:
Caused by: java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1624)
at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1197)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
我只有一个DStream的事实,我不知道为什么我得到这个例外。 我正在阅读Kafka主题中的3个分区。我假设“createDirectStream”将创建3个消费者来读取数据。
以下是KafkaConsumer的代码,获取方法:
private void acquire() {
this.ensureNotClosed();
long threadId = Thread.currentThread().getId();
if(threadId != this.currentThread.get() && !this.currentThread.compareAndSet(-1L, threadId)) {
throw new ConcurrentModificationException("KafkaConsumer is not safe for multi-threaded access");
} else {
this.refcount.incrementAndGet();
}
}
答案 0 :(得分:6)
Spark 2.2.0有一个使用无缓存的解决方法。
只需将spark.streaming.kafka.consumer.cache.enabled用于false
即可。
请查看此pull请求
答案 1 :(得分:0)
如此错误报告中所述:https://issues.apache.org/jira/browse/SPARK-19185,这是Spark / Kafka的一个已知问题。
就我而言,我将避免使用窗口,并将分区与batchInterval和blockInterval结合使用,如下所述:https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
答案 2 :(得分:0)
这是java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access的类似问题,你有多个线程运行同一个消费者,而Kafka不支持多线程。另外请确保您没有使用spark.speculation = true,因为它会导致上述错误。
答案 3 :(得分:0)
在这段代码中,您对RDD执行两个操作
transformedMessages.foreachRDD(record -> {
//System.out.println("Aman" +record.count());
StructType schema =
DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("trades");
System.out.println(ds.count());
ds.show();
});
来自Consumer Group的两个消费者试图读取Kafka主题分区,但是Kafka允许一个消费者组中的只有一个消费者可以读取Kafka主题分区。该问题的解决方案是:缓存RDD
transformedMessages.foreachRDD(record -> {
//System.out.println("Aman" +record.count());
StructType schema =
DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.cache()
System.out.println(ds.count());
ds.show();
});