我正在使用Spark的多输入流读取器来读取来自Kafka的消息。我得到下面提到的错误。如果我不使用多个输入流阅读器,我没有收到任何错误。为了实现性能,我需要使用并行概念,测试目的我只使用一个。
错误
java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = test, partition = 0, offset = 120, CreateTime = -1, checksum = 2372777361, serialized key size = -1, serialized value size = 48, key = null, value = 10051,2018-03-15 17:12:24+0000,Bentonville,Gnana))
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/03/15 17:12:24 ERROR TaskSetManager: Task 0.0 in stage 470.0 (TID 470) had a not serializable result: org.apache.kafka.clients.consumer.ConsumerRecord
代码:
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.Success
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
object ParallelStreamJob {
def main(args: Array[String]): Unit = {
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(50))
val kafkaStream = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test")
val numPartitionsOfInputTopic = 1
val streams = (1 to numPartitionsOfInputTopic) map { _ =>
KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
}
val unifiedStream = ssc.union(streams)
val sparkProcessingParallelism = 1
unifiedStream.repartition(sparkProcessingParallelism)
}
kafkaStream.foreachRDD(rdd=> {
rdd.foreach(conRec=> {
println(conRec.value())
})
})
println(" Spark parallel reader is ready !!!")
ssc.start()
ssc.awaitTermination()
}
}
SBT
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion ,
"org.apache.spark" %% "spark-sql" % sparkVersion ,
"org.apache.spark" %% "spark-hive" % sparkVersion ,
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion ,
"org.apache.kafka" %% "kafka" % "0.10.1.0",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion ,
)
如何解决此问题?
答案 0 :(得分:1)
问题很明显java.io.NotSerializableException:org.apache.kafka.clients.consumer.ConsumerRecord
。 ConsumerRecord课程不会延长Serializable
尝试在value
操作ConsumerRecord
之前取出foreachRdd
的{{1}}字段。
更新1:上述修复无效,因为异常发生在kafkaStream.map(_.value())
。ssc.union(streams)
需要在节点之间进行数据传输,它必须序列化数据。因此,您可以在ssc.union(streams)
操作之前按value
取出map
字段来解决问题。
union
答案 1 :(得分:0)
首先,如果您有1个主题,那么您不应该使用创建多个Kafkastream,因为您使用的是直接方法,它将自动创建与主题的多个Kafka分区一样多的线程数如果您关注DirectApproach,.Spark将自动处理您的任务的并行化。 尝试在RDD级别使用repartition(),而不是重新划分Dstream本身。