启用了Spark Streaming Checkpoint的java.io.NotSerializableException

时间:2018-04-11 01:11:51

标签: scala apache-spark spark-streaming

我在我的spark流应用程序中启用了检查点,并在作为依赖项下载的类中遇到此错误。

没有检查点,应用程序运行良好。

错误:

com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer
Serialization stack:
    - object not serializable (class: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer, value: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer@46c7c593)
    - field (class: com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, name: _paranamer, type: interface com.fasterxml.jackson.module.paranamer.shaded.Paranamer)
    - object (class com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector@39d62e47)
    - field (class: com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, name: _secondary, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
    - object (class com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair@7a925ac4)
    - field (class: com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, name: _primary, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
    - object (class com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair@203b98cf)
    - field (class: com.fasterxml.jackson.databind.cfg.BaseSettings, name: _annotationIntrospector, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
    - object (class com.fasterxml.jackson.databind.cfg.BaseSettings, com.fasterxml.jackson.databind.cfg.BaseSettings@78c34153)
    - field (class: com.fasterxml.jackson.databind.cfg.MapperConfig, name: _base, type: class com.fasterxml.jackson.databind.cfg.BaseSettings)
    - object (class com.fasterxml.jackson.databind.DeserializationConfig, com.fasterxml.jackson.databind.DeserializationConfig@2df0a4c3)
    - field (class: com.fasterxml.jackson.databind.ObjectMapper, name: _deserializationConfig, type: class com.fasterxml.jackson.databind.DeserializationConfig)
    - object (class com.fasterxml.jackson.databind.ObjectMapper, com.fasterxml.jackson.databind.ObjectMapper@2db07651)

我不确定如何将此类扩展为可序列化的maven依赖项。我在我的pom.xml中使用了jackson核心的v2.6.0。如果我尝试使用更新版本的Jackson核心,我将获得不兼容的Jackson版本例外。

代码

liveRecordStream
      .foreachRDD(newRDD => {
        if (!newRDD.isEmpty()) {
          val cacheRDD = newRDD.cache()
          val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
          val rdd = updatestgDFMap(stgDFMap, cacheRDD)
          persistStgTable(stgDFMap)
          dfMap
            .filter(entry => updTempTables.contains(entry._2))
            .map(spark.sql)
            .foreach( df => writeToES(writer, df))

          cacheRDD.unpersist()
        }
      }

只有在foreachRDD内部发生方法调用时才会出现此问题,例如tempTableView

tempTableView

def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
    stgDFMap.keys.filter { table =>
      val tRDD = cacheRDD
        .filter(r => r.Name == table)
        .map(r => r.values)
         val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
      if (!tRDD.isEmpty()) {
        val tName = s"temp_$table"
        tDF.createOrReplaceTempView(tName)
      }
      !tRDD.isEmpty()
    }.toSet
  }

感谢任何帮助。不确定如何调试此问题并解决问题。

2 个答案:

答案 0 :(得分:4)

从您分享的代码段中,我看不到调用jackson库的位置。但是,当您尝试发送一个没有通过线路实现NotSerializableException接口的对象时,通常会发生Serializable

Spark是分布式处理引擎,意味着它以这种方式工作:跨节点有一个驱动程序和多个执行程序。只有需要计算的代码部分由driver发送到executors(通过线路)。 Spark转换以这种方式发生,即跨越多个节点,如果你尝试将一个没有实现serializable接口的类的实例传递给这样的代码块(跨节点执行的块),它将会抛出NotSerializableException

例如:

def main(args: Array[String]): Unit = {
   val gson: Gson = new Gson()

   val sparkConf = new SparkConf().setMaster("local[2]")
   val spark = SparkSession.builder().config(sparkConf).getOrCreate()
   val rdd = spark.sparkContext.parallelize(Seq("0","1"))

   val something = rdd.map(str => {
     gson.toJson(str)
   })

   something.foreach(println)
   spark.close()
}

此代码块将抛出NotSerializableException,因为我们正在向分布式函数发送Gson的实例。 map是Spark转换操作,因此它将在执行程序上执行。以下内容适用:

def main(args: Array[String]): Unit = {

   val sparkConf = new SparkConf().setMaster("local[2]")
   val spark = SparkSession.builder().config(sparkConf).getOrCreate()
   val rdd = spark.sparkContext.parallelize(Seq("0","1"))

   val something = rdd.map(str => {
     val gson: Gson = new Gson()
     gson.toJson(str)
   })

   something.foreach(println)
   spark.close()
}

上面的原因是,我们在转换中实例化Gson,因此它将在执行程序中实例化,这意味着它不会通过线路从驱动程序发送,所以没有需要序列化。

答案 1 :(得分:0)

问题在于jackson objectMapper正在尝试序列化。 objectMapper不应序列化。通过添加@transient val objMapper = new ObjectMapper...

来解决此问题