我写过从Kafka和commits offsets manually读取的Spark工作。它工作正常但是因为我引入了广播变量,我得到可序列化的异常,因为它试图序列化KafkaInputDStream。这是一个显示问题的最小代码(代码是用Kotlin编写的,但我相信它也会在Java中发生):
fun testBroadCast(jsc: JavaStreamingContext, kafkaStream: JavaInputDStream<ConsumerRecord<String, SomeSerializableEvent>>) {
val keyPrefix = jsc.sparkContext().broadcast("EVENT:")
kafkaStream.foreachRDD { rdd ->
val offsetRanges = (rdd.rdd() as HasOffsetRanges).offsetRanges()
val prefixedIds = rdd.map { "${keyPrefix.value}:$it" }.collect()
(kafkaStream.dstream() as CanCommitOffsets).commitAsync(offsetRanges)
}
}
fun main(args: Array<String>) {
val jsc = JavaStreamingContext(SparkConf().setAppName("test simple prefixer").setMaster("local[*]"), Duration(5000))
val stream = makeStreamFromSerializableEventTopic(jsc)
testBroadCast(jsc, stream)
jsc.start()
jsc.awaitTermination()
}
如果我删除keyPreix
并将&#34; EVENT:&#34;在map函数中,它按预期工作。否则我得到:
java.io.NotSerializableException:org.apache.spark.streaming.kafka010.DirectKafkaInputDStream的对象可能被序列化为RDD操作的闭包的一部分。这是因为正在从闭包内引用DStream对象。请在此DStream中重写RDD操作以避免这种情况。这已被强制执行以避免使用不必要的对象使Spark任务膨胀。 at org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply $ mcV $ sp(DStream.scala:525) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply(DStream.scala:512) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply(DStream.scala:512) 在org.apache.spark.util.Utils $ .tryOrIOException(Utils.scala:1303) 在org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:512) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) 在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295) 在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288) 在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2287) 在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:370) 在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:369) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:362) 在org.apache.spark.rdd.RDD.map(RDD.scala:369) 在org.apache.spark.api.java.JavaRDDLike $ class.map(JavaRDDLike.scala:93) 在org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:45) 在ir.pegahtech.tapsell.brain.engine.jobs.Test $ testBroadCast $ 1.call(Test.kt:226) 在ir.pegahtech.tapsell.brain.engine.jobs.Test $ testBroadCast $ 1.call(Test.kt) 在org.apache.spark.streaming.api.java.JavaDStreamLike $$ anonfun $ foreachRDD $ 1.apply(JavaDStreamLike.scala:272) 在org.apache.spark.streaming.api.java.JavaDStreamLike $$ anonfun $ foreachRDD $ 1.apply(JavaDStreamLike.scala:272) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:628) 在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:628) 在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply $ mcV $ sp(ForEachDStream.scala:51) 在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:51) 在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:51) 在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416) 在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply $ mcV $ sp(ForEachDStream.scala:50) 在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:50) 在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:50) 在scala.util.Try $ .apply(Try.scala:192) 在org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) 在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply $ mcV $ sp(JobScheduler.scala:257) 在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:257) 在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:257) 在scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) 在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:256) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)
如何使用或不使用广播变量与序列化KafkaInputDStream有何关系? Spark版本是2.2.0。