Question

我有两个Scala代码-MyMain.scala和MyFunction.scala，分别构建和构建的MyFunction jar将充当MyMain中的UDF。

MyFunction.scala基本上包含带有公共方法public String myFunc(String val0, String val1)的Java类。该项目使用SBT构建，并且build_jar编译输出存储为工件（仅所需的类，即MyFunction.class，而不是依赖项）。

MyMain.scala将上面的工件jar导入到下面的lib文件夹中，并使用unmanagedBase := baseDirectory.value / "lib"中的build.sbt添加到类路径中

因此MyMain.scala项目结构如下：

MyMain
| 
-lib/MyFunction.jar
       |
       - META-INF/MANIFEST.MF
       - MyFunction.class
-project
-src/main/scala/MyMain.scala
-build.sbt

/ 我需要做什么 /

我想在MyMain.scala中的MyFunction.jar内的MyFunction.class上定义UDF，该UDF已添加到lib类路径中。我已经定义了UDF，但是当我尝试在MyMain.scala内的Spark数据帧上使用UDF时，它将抛出“无法序列化的任务” java.io.NotSerializableException，如下所示：

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
  at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:616)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
  at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:747)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:724)
  at MyMain$.main(<pastie>:253)
  ... 58 elided
Caused by: java.io.NotSerializableException: MyMain$
Serialization stack:
    - object not serializable (class: MyMain$, value: MyMain$@11f25cf)
    - field (class: $iw, name: MyMain$module, type: class MyMain$)
    - object (class $iw, $iw@540705e8)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7e6e1038)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7587f2a0)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5e00f263)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3fbfe419)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5172e87b)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5ec96f75)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@26f6de78)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@18c3bc83)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@35d674ee)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5712092f)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6980c2e6)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6ce299e)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@406b8acb)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@73d71e61)
    - field (class: $line47.$read, name: $iw, type: class $iw)
    - object (class $line47.$read, $line47.$read@72ee2f87)
    - field (class: $iw, name: $line47$read, type: class $line47.$read)
    - object (class $iw, $iw@22c4de5a)
    - field (class: $iw, name: $outer, type: class $iw)
    - object (class $iw, $iw@3daea539)
    - field (class: $anonfun$1, name: $outer, type: class $iw)
    - object (class $anonfun$1, <function2>)
    - element of array (index: 9)
    - array (class [Ljava.lang.Object;, size 15)
    - field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11, name: references$1, type: class [Ljava.lang.Object;)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11, <function2>)
  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
  ... 92 more

/ 是什么原因 /

MyMain.scala指的是Spark数据帧上某些转换内的类的一些不可序列化的实例

/ 我尝试过的 /

object MyFunction extends Serializable {
  val myFuncSingleton = new MyFunction()
  def getMyFunc(var0:String,var1:String) : String = {
    myFuncSingleton.myFunc(var0,var1)
  }
}

import org.apache.spark.sql.functions.udf
val myUDF = udf((val0: String, val1: String) => { MyFunction.getMyFunc(val0, val1) })

object MyMain {
  val spark = ...
  val hadoopfs = ...
  def main(args: Array[String]) : Unit = {
    val df1 = ...
    val df2 = df1.withColumn("reg_id", myUDF(lit("Subscriber"), col("id")))
  }
}

请参阅以下链接 how-to-solve-non-serializable-errors-when-instantiating-objects-in-spark-udfs

Answer 1

对代码进行了细微调整，它解决了我的问题。

尽管我不完全了解Scala编译器的内部工作原理以及它如何处理UDF，但我将尝试解释我的解决方案以及Task not serializable错误的可能原因：

在myUDF中使用withColumn(...)变量不在任何RDD闭包内。
在驱动程序外部的udf(...)定义中，在Scala对象MyFunction上调用getMyFunc(...)等效于调用静态方法，因此无需使用MyFunction对象进行序列化作为单例对象，而不是MyFunction类的实例（在MyFunction.jar中定义）。这解释了MyFunction定义从object MyFunction extends Serializable到object MyFunction的变化。
但是，在“包装器”单例MyFunction对象中，myFuncSingleton被定义为MyFunction类（在jar中）的实例，并且myFuncSingleton.myFunc(...)调用了myFunc(...)方法这个实例。
但是，myFuncSingleton实例以及通过myUDF在驱动程序中引用的MyFunction类在RDD闭包之外（如1中所述），因此MyFunction类需要显式序列化，即public class MyFunction implements java.io.Serializable（因为jar内置Java类）

如1.中所述，由于withColumn(...)中的UDF调用不在RDD闭包之内，因此需要对MyMain对象进行序列化以使UDF跨分区使用，即object MyMain extends Serializable

object MyFunction {
  val myFuncSingleton = new MyFunction()
  def getMyFunc(var0:String,var1:String) : String = {
    myFuncSingleton.myFunc(var0,var1)
  }
}

import org.apache.spark.sql.functions.udf
val myUDF = udf((val0: String, val1: String) => { MyFunction.getMyFunc(val0, val1) })

object MyMain extends Serializable {
  val spark = ...
  val hadoopfs = ...
  def main(args: Array[String]) : Unit = {
    val df1 = ...
    val df2 = df1.withColumn("reg_id", myUDF(lit("Subscriber"), col("id")))
  }
}

注意：

总而言之，我正在通过MyFunction单例对象的静态方法调用来调用MyFunction实例方法。因此，val myFuncVar = new MyFunction()应该比val myFuncSingleton = new MyFunction()更合适。
我不完全了解RDD闭包的细微差别，也不确定withColumn（）是否在RDD闭包之外，但出于解释目的而假定。

这里有一些很好的解释：How Spark handles object

org.apache.spark.SparkException：任务无法序列化原因：java.io.NotSerializableException

1 个答案: