我有两个Scala代码-MyMain.scala和MyFunction.scala,分别构建和构建的MyFunction jar将充当MyMain中的UDF。
MyFunction.scala基本上包含带有公共方法public String myFunc(String val0, String val1)
的Java类。该项目使用SBT构建,并且build_jar编译输出存储为工件(仅所需的类,即MyFunction.class,而不是依赖项)。
MyMain.scala将上面的工件jar导入到下面的lib文件夹中,并使用unmanagedBase := baseDirectory.value / "lib"
中的build.sbt
添加到类路径中
因此MyMain.scala项目结构如下:
MyMain
|
-lib/MyFunction.jar
|
- META-INF/MANIFEST.MF
- MyFunction.class
-project
-src/main/scala/MyMain.scala
-build.sbt
/ 我需要做什么 /
我想在MyMain.scala中的MyFunction.jar内的MyFunction.class上定义UDF,该UDF已添加到lib类路径中。我已经定义了UDF,但是当我尝试在MyMain.scala内的Spark数据帧上使用UDF时,它将抛出“无法序列化的任务” java.io.NotSerializableException,如下所示:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:616)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:747)
at org.apache.spark.sql.Dataset.show(Dataset.scala:724)
at MyMain$.main(<pastie>:253)
... 58 elided
Caused by: java.io.NotSerializableException: MyMain$
Serialization stack:
- object not serializable (class: MyMain$, value: MyMain$@11f25cf)
- field (class: $iw, name: MyMain$module, type: class MyMain$)
- object (class $iw, $iw@540705e8)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@7e6e1038)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@7587f2a0)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5e00f263)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@3fbfe419)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5172e87b)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5ec96f75)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@26f6de78)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@18c3bc83)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@35d674ee)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5712092f)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@6980c2e6)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@6ce299e)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@406b8acb)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@73d71e61)
- field (class: $line47.$read, name: $iw, type: class $iw)
- object (class $line47.$read, $line47.$read@72ee2f87)
- field (class: $iw, name: $line47$read, type: class $line47.$read)
- object (class $iw, $iw@22c4de5a)
- field (class: $iw, name: $outer, type: class $iw)
- object (class $iw, $iw@3daea539)
- field (class: $anonfun$1, name: $outer, type: class $iw)
- object (class $anonfun$1, <function2>)
- element of array (index: 9)
- array (class [Ljava.lang.Object;, size 15)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 92 more
/ 是什么原因 /
MyMain.scala指的是Spark数据帧上某些转换内的类的一些不可序列化的实例
/ 我尝试过的 /
object MyFunction extends Serializable {
val myFuncSingleton = new MyFunction()
def getMyFunc(var0:String,var1:String) : String = {
myFuncSingleton.myFunc(var0,var1)
}
}
import org.apache.spark.sql.functions.udf
val myUDF = udf((val0: String, val1: String) => { MyFunction.getMyFunc(val0, val1) })
object MyMain {
val spark = ...
val hadoopfs = ...
def main(args: Array[String]) : Unit = {
val df1 = ...
val df2 = df1.withColumn("reg_id", myUDF(lit("Subscriber"), col("id")))
}
}
请参阅以下链接 how-to-solve-non-serializable-errors-when-instantiating-objects-in-spark-udfs
答案 0 :(得分:0)
对代码进行了细微调整,它解决了我的问题。
尽管我不完全了解Scala编译器的内部工作原理以及它如何处理UDF,但我将尝试解释我的解决方案以及Task not serializable
错误的可能原因:
myUDF
中使用withColumn(...)
变量不在任何RDD闭包内。 udf(...)
定义中,在Scala对象MyFunction上调用getMyFunc(...)
等效于调用静态方法,因此无需使用MyFunction对象进行序列化作为单例对象,而不是MyFunction
类的实例(在MyFunction.jar中定义)。这解释了MyFunction
定义从object MyFunction extends Serializable
到object MyFunction
的变化。 myFuncSingleton
被定义为MyFunction
类(在jar中)的实例,并且myFuncSingleton.myFunc(...)
调用了myFunc(...)
方法这个实例。 myFuncSingleton
实例以及通过myUDF
在驱动程序中引用的MyFunction类在RDD闭包之外(如1中所述),因此MyFunction类需要显式序列化,即public class MyFunction implements java.io.Serializable
(因为jar内置Java类)如1.中所述,由于withColumn(...)
中的UDF调用不在RDD闭包之内,因此需要对MyMain对象进行序列化以使UDF跨分区使用,即object MyMain extends Serializable
object MyFunction {
val myFuncSingleton = new MyFunction()
def getMyFunc(var0:String,var1:String) : String = {
myFuncSingleton.myFunc(var0,var1)
}
}
import org.apache.spark.sql.functions.udf
val myUDF = udf((val0: String, val1: String) => { MyFunction.getMyFunc(val0, val1) })
object MyMain extends Serializable {
val spark = ...
val hadoopfs = ...
def main(args: Array[String]) : Unit = {
val df1 = ...
val df2 = df1.withColumn("reg_id", myUDF(lit("Subscriber"), col("id")))
}
}
注意:
val myFuncVar = new MyFunction()
应该比val myFuncSingleton = new MyFunction()
更合适。这里有一些很好的解释:How Spark handles object