Scala Spark - 任务不可序列化

时间:2015-09-18 21:04:26

标签: scala apache-spark

我有以下代码,其中错误位于sc.parallelize()

val pairs = ret.cartesian(ret)
    .map {
        case ((k1, v1), (k2, v2)) => ((k1, k2), (v1.toList, v2.toList))
    }
for (pair <- pairs) {
    val test = sc.parallelize(pair._2._1.map(_._1 ))
}

其中

  • k1,k2是字符串
  • v1,v2是双打列表

每当我尝试访问sc时,我都会收到以下错误。我在这里做错了什么?

  

线程中的异常&#34; main&#34; org.apache.spark.SparkException:任务不可序列化           在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:315)           在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:305)           在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:132)           在org.apache.spark.SparkContext.clean(SparkContext.scala:1893)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1.apply(RDD.scala:869)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1.apply(RDD.scala:868)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:147)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:108)           在org.apache.spark.rdd.RDD.withScope(RDD.scala:286)           在org.apache.spark.rdd.RDD.foreach(RDD.scala:868)           在CorrelationCalc $ .main(CorrelationCalc.scala:33)           在CorrelationCalc.main(CorrelationCalc.scala)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           at java.lang.reflect.Method.invoke(Method.java:606)           在org.apache.spark.deploy.SparkSubmit $ .org $ apache $ spark $ deploy $ SparkSubmit $$ runMain(SparkSubmit.scala:665)           在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:170)           在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:193)           在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:112)           在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)   引起:java.io.NotSerializableException:org.apache.spark.SparkContext   序列化堆栈:            - 对象不可序列化(类:org.apache.spark.SparkContext,值:org.apache.spark.SparkContext@40bee8c5)            - field(class:CorrelationCalc $$ anonfun $ main $ 1,name:sc $ 1,type:class org.apache.spark.SparkContext)            - 对象(类CorrelationCalc $$ anonfun $ main $ 1,)           在org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)           在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)           在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)           在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:312)           ......还有20个

1 个答案:

答案 0 :(得分:2)

for-comprehension就是在做pair.map()

RDD操作由工作人员执行并让他们完成这项工作,您发送给他们的任何内容都必须是可序列化的。 SparkContext附加到master:它负责管理整个集群。

如果你想创建一个RDD,你必须要知道整个集群(这是第二个&#34; D&#34; ---分布式),所以你不能创建一个关于工人的新RDD。而且你可能不想成对地将每一行变成一个RDD(并且每个都有相同的名字!)。

很难从你的代码中判断出你想做什么,但它可能看起来像

val test = pairs.map( r => r._2._1) 

这将是一个RDD,其中每一行都是v1.toList中的任何内容