在Spark中尝试提交Kafka偏移时获取任务不可序列化,广播变量

时间:2018-01-17 06:44:56

标签: java apache-spark kotlin

我写过从Kafka和commits offsets manually读取的Spark工作。它工作正常但是因为我引入了广播变量,我得到可序列化的异常,因为它试图序列化KafkaInputDStream。这是一个显示问题的最小代码(代码是用Kotlin编写的,但我相信它也会在Java中发生):

fun testBroadCast(jsc: JavaStreamingContext, kafkaStream: JavaInputDStream<ConsumerRecord<String, SomeSerializableEvent>>) {
    val keyPrefix = jsc.sparkContext().broadcast("EVENT:")
    kafkaStream.foreachRDD { rdd ->
        val offsetRanges = (rdd.rdd() as HasOffsetRanges).offsetRanges()
        val prefixedIds = rdd.map { "${keyPrefix.value}:$it" }.collect()
        (kafkaStream.dstream() as CanCommitOffsets).commitAsync(offsetRanges)
    }
}

fun main(args: Array<String>) {
    val jsc = JavaStreamingContext(SparkConf().setAppName("test simple prefixer").setMaster("local[*]"), Duration(5000))
    val stream = makeStreamFromSerializableEventTopic(jsc)
    testBroadCast(jsc, stream)
    jsc.start()
    jsc.awaitTermination()
}

如果我删除keyPreix并将&#34; EVENT:&#34;在map函数中,它按预期工作。否则我得到:

  

java.io.NotSerializableException:org.apache.spark.streaming.kafka010.DirectKafkaInputDStream的对象可能被序列化为RDD操作的闭包的一部分。这是因为正在从闭包内引用DStream对象。请在此DStream中重写RDD操作以避免这种情况。这已被强制执行以避免使用不必要的对象使Spark任务膨胀。       at org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply $ mcV $ sp(DStream.scala:525)       在org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply(DStream.scala:512)       在org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply(DStream.scala:512)       在org.apache.spark.util.Utils $ .tryOrIOException(Utils.scala:1303)       在org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:512)       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)       at java.lang.reflect.Method.invoke(Method.java:498)       at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)       at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)       at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)       at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)       at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)       at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)       at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)       at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)       at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)       at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)       at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)       at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)       at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)       at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)       at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)       at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)       在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)       在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)       在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295)       在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288)       在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108)       在org.apache.spark.SparkContext.clean(SparkContext.scala:2287)       在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:370)       在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:369)       在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)       在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)       在org.apache.spark.rdd.RDD.withScope(RDD.scala:362)       在org.apache.spark.rdd.RDD.map(RDD.scala:369)       在org.apache.spark.api.java.JavaRDDLike $ class.map(JavaRDDLike.scala:93)       在org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:45)       在ir.pegahtech.tapsell.brain.engine.jobs.Test $ testBroadCast $ 1.call(Test.kt:226)       在ir.pegahtech.tapsell.brain.engine.jobs.Test $ testBroadCast $ 1.call(Test.kt)       在org.apache.spark.streaming.api.java.JavaDStreamLike $$ anonfun $ foreachRDD $ 1.apply(JavaDStreamLike.scala:272)       在org.apache.spark.streaming.api.java.JavaDStreamLike $$ anonfun $ foreachRDD $ 1.apply(JavaDStreamLike.scala:272)       在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:628)       在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:628)       在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply $ mcV $ sp(ForEachDStream.scala:51)       在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:51)       在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:51)       在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)       在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply $ mcV $ sp(ForEachDStream.scala:50)       在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:50)       在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:50)       在scala.util.Try $ .apply(Try.scala:192)       在org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)       在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply $ mcV $ sp(JobScheduler.scala:257)       在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:257)       在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:257)       在scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)       在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:256)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)       在java.lang.Thread.run(Thread.java:745)

如何使用或不使用广播变量与序列化KafkaInputDStream有何关系? Spark版本是2.2.0。

0 个答案:

没有答案