java.io.NotSerializableException:org.apache.spark.streaming.StreamingContext

时间:2018-04-12 01:58:24

标签: scala apache-spark spark-streaming

尝试运行启用了检查点的火花流应用程序时,遇到此错误。

java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
    Serialization stack:
    org.apache.spark.streaming.StreamingContext
    Serialization stack:
        - object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@63cf0da6)
        - object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@63cf0da6)
        - field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
        - field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
        - object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1@1738d3b2)
        - object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1@1738d3b2)
        - field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
        - field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
        - object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
        - object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)

尝试执行这段代码。我认为这个问题与尝试访问spark函数中的tempTableView会话变量

有关

代码

liveRecordStream
      .foreachRDD(newRDD => {
        if (!newRDD.isEmpty()) {
          val cacheRDD = newRDD.cache()
          val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
          val rdd = updatestgDFMap(stgDFMap, cacheRDD)
          persistStgTable(stgDFMap)
          dfMap
            .filter(entry => updTempTables.contains(entry._2))
            .map(spark.sql)
            .foreach( df => writeToES(writer, df))
        }
      }

tempTableView

def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
    stgDFMap.keys.filter { table =>
      val tRDD = cacheRDD
        .filter(r => r.Name == table)
        .map(r => r.values)
         val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
      if (!tRDD.isEmpty()) {
        val tName = s"temp_$table"
        tDF.createOrReplaceTempView(tName)
      }
      !tRDD.isEmpty()
    }.toSet
  }

不确定如何在此函数内获取spark会话变量,该变量在foreachRDD内调用。

我将streamingContext实例化为另一个类的一部分。

class Test {
lazy val sparkSession: SparkSession =
    SparkSession
      .builder()
      .appName("testApp")
      .config("es.nodes", SalesConfig.elasticnode)
      .config("es.port", SalesConfig.elasticport)
      .config("spark.sql.parquet.filterPushdown", parquetFilterPushDown)
      .config("spark.debug.maxToStringFields", 100000)
      .config("spark.rdd.compress", rddCompress)
      .config("spark.task.maxFailures", 25)
      .config("spark.streaming.unpersist", streamingUnPersist)
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val streamingContext: StreamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(15))
streamingContext.checkpoint("/Users/gswaminathan/Guidewire/Java/explore-policy/checkpoint/")
}

我尝试将此课程扩展为Serializable,但没有运气。

0 个答案:

没有答案