Spark Kafka Streaming CommitAsync错误

时间:2018-03-18 04:25:39

标签: scala apache-spark spark-streaming rdd scala-streams

我是Scala和RDD概念的新手。在Spark中使用Kafka流api从kafka读取消息并尝试在业务工作后提交。但我收到了错误。

注意:使用重新分区进行并行工作

如何从流APi中读取偏移量并将其提交给Kafka?

  

scalaVersion:=“2.11.8”val sparkVersion =“2.2.0”val   connectorVersion =“2.0.7”val kafka_stream_version =“1.6.3”

代码

    val ssc = new StreamingContext(spark.sparkContext, Seconds(2))
    ssc.checkpoint("C:/Gnana/cp")

    val kafkaStream = {

      val kafkaParams = Map[String, Object](
        "bootstrap.servers" -> "localhost:9092",
        "key.deserializer" -> classOf[StringDeserializer],
        "value.deserializer" -> classOf[StringDeserializer],
        "group.id" -> "ignite3",

        "auto.offset.reset" -> "latest",
        "enable.auto.commit" -> (false: java.lang.Boolean)
      )

      val topics = Array("test")
      val numPartitionsOfInputTopic = 2
      val streams = (1 to numPartitionsOfInputTopic) map {
        _ => KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ).map(_.value())
      }

      val unifiedStream = ssc.union(streams)
      val sparkProcessingParallelism = 1
      unifiedStream.repartition(sparkProcessingParallelism)
    }
//Finding offsetRanges
kafkaStream
  .transform {
    rdd =>
      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd
  }
//do business operation and persist offset to kafka
kafkaStream.foreachRDD(rdd=> {
  println("offsetRanges:"+offsetRanges)
  rdd.foreach(conRec=> {
    println(conRec)
    kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
  })
})

    println(" Spark parallel reader is ready !!!")

   ssc.start()
    ssc.awaitTermination()
  }

错误

java.io.NotSerializableException:org.apache.spark.streaming.dstream.TransformedDStream的对象可能被序列化,可能是RDD操作关闭的一部分。这是因为正在从闭包内引用DStream对象。请在此DStream中重写RDD操作以避免这种情况。这已被强制执行以避免使用不必要的对象使Spark任务膨胀。     at org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply $ mcV $ sp(DStream.scala:525)     在org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply(DStream.scala:512)     在org.apache.spark.streaming.dstream.DStream $$ anonfun $ writeObject $ 1.apply(DStream.scala:512)     在org.apache.spark.util.Utils $ .tryOrIOException(Utils.scala:1303)     在org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:512)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)     at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)     at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)     at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)     at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)     at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)     at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)     at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)     at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)     at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)     at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)     at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)     at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)     在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)

1 个答案:

答案 0 :(得分:0)

在计算偏移范围之前不要重新分配。如果您这样做,那么将遇到此问题。要测试您只需删除重新分区,然后尝试运行此应用程序。