如何传递spark广播和累加器变量来映射和减少函数

时间:2015-06-17 13:35:17

标签: scala apache-spark

请考虑以下代码段

class SparkJob extends Serializable{
//Some code and other functions
def launchJob = {
    val broadcastConfiguration = sc.broadcast(options) //options is some case class
    val accumulator = //create instance of accumulator
    inputFile.mapPartitions(lines => testMap(lines, broadcastConfiguration, accumulator)) //this line will throw a serialization error
}
object SparkJob{
//apply and other functions
def testMap(lines: Iterable[String], broadcastConfiguration: ... //other params) = //function definition
}

如何将accumulator和broadcastConfiguration的实例与其他函数一起传递?

我只尝试使用inputFile.mapPartitions(lines => testMap(lines))并且它工作正常,所以在我看来共享变量在传递时是个问题。我该如何做到这一点?

编辑:添加了例外跟踪

15/06/17 19:12:56 INFO SparkContext: Created broadcast 3 from textFile at SparkJob.scala:74
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)
    at com.auditude.databuild.steps.SparkJob.launchJob(SparkJob.scala:80)
    at com.auditude.databuild.steps.SparkJobDriver$.main(SparkJobDriver.scala:37)
    at com.auditude.databuild.steps.SparkJobDriver.main(SparkJobDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
    ... 11 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
    at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
    ... 29 more

Edit2:根据建议添加了@transient,但没有帮助。我甚至试过这种方法,但仍然得到同样的答案。

val mapResult = inputFile.mapPartitions(lines =>  {
  println(broadcastConfiguration.value)
  lines
})

编辑3 - 在进一步调查中,我意识到为了简化我的代码,我已经省略了我在类的构造函数中初始化broadcastConfiguration的细节,所以实际的代码如下所示:

class SparkJob extends Serializable{
  //constructor
  val broadcastConfiguration = sc.broadcast(options) //options is some case class
  val accumulator = //create instance of accumulator
//Some code and other functions
def launchJob = {
    inputFile.mapPartitions(lines => testMap(lines, broadcastConfiguration, accumulator)) //this line will throw a serialization error
}

我尝试使用一个简单的字符串而不使用broadcastVariable,在构造函数中初始化它仍然失败。如果我将该字符串的声明移动到函数launchJob中,它似乎至少可以用于字符串。如果广播变量以这种方式工作,将报告回进一步编辑。我仍然想知道为什么会发生这种情况,因为我的班级被宣布为Serializable

0 个答案:

没有答案