请考虑以下代码段
class SparkJob extends Serializable{
//Some code and other functions
def launchJob = {
val broadcastConfiguration = sc.broadcast(options) //options is some case class
val accumulator = //create instance of accumulator
inputFile.mapPartitions(lines => testMap(lines, broadcastConfiguration, accumulator)) //this line will throw a serialization error
}
object SparkJob{
//apply and other functions
def testMap(lines: Iterable[String], broadcastConfiguration: ... //other params) = //function definition
}
如何将accumulator和broadcastConfiguration的实例与其他函数一起传递?
我只尝试使用inputFile.mapPartitions(lines => testMap(lines))
并且它工作正常,所以在我看来共享变量在传递时是个问题。我该如何做到这一点?
编辑:添加了例外跟踪
15/06/17 19:12:56 INFO SparkContext: Created broadcast 3 from textFile at SparkJob.scala:74
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)
at com.auditude.databuild.steps.SparkJob.launchJob(SparkJob.scala:80)
at com.auditude.databuild.steps.SparkJobDriver$.main(SparkJobDriver.scala:37)
at com.auditude.databuild.steps.SparkJobDriver.main(SparkJobDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 11 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
... 29 more
Edit2:根据建议添加了@transient,但没有帮助。我甚至试过这种方法,但仍然得到同样的答案。
val mapResult = inputFile.mapPartitions(lines => {
println(broadcastConfiguration.value)
lines
})
编辑3 - 在进一步调查中,我意识到为了简化我的代码,我已经省略了我在类的构造函数中初始化broadcastConfiguration的细节,所以实际的代码如下所示:
class SparkJob extends Serializable{
//constructor
val broadcastConfiguration = sc.broadcast(options) //options is some case class
val accumulator = //create instance of accumulator
//Some code and other functions
def launchJob = {
inputFile.mapPartitions(lines => testMap(lines, broadcastConfiguration, accumulator)) //this line will throw a serialization error
}
我尝试使用一个简单的字符串而不使用broadcastVariable,在构造函数中初始化它仍然失败。如果我将该字符串的声明移动到函数launchJob中,它似乎至少可以用于字符串。如果广播变量以这种方式工作,将报告回进一步编辑。我仍然想知道为什么会发生这种情况,因为我的班级被宣布为Serializable
。