org.apache.spark.SparkException:任务不可序列化 - JavaSparkContext

时间:2015-06-08 12:54:31

标签: java serialization apache-spark

我试图运行以下简单的Spark代码:

Gson gson = new Gson();
JavaRDD<String> stringRdd = jsc.textFile("src/main/resources/META-INF/data/supplier.json");

JavaRDD<SupplierDTO> rdd = stringRdd.map(new Function<String, SupplierDTO>()
{
    private static final long serialVersionUID = -78238876849074973L;

    @Override
    public SupplierDTO call(String str) throws Exception
    {
        return gson.fromJson(str, SupplierDTO.class);
    }
});

但是在执行stringRdd.map语句时它会抛出以下错误:

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
at org.apache.spark.rdd.RDD.map(RDD.scala:288)
at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78)
at org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:32)
at com.demo.spark.processor.cassandra.CassandraDataUploader.uploadData(CassandraDataUploader.java:71)
at com.demo.spark.processor.cassandra.CassandraDataUploader.main(CassandraDataUploader.java:47)
Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 7 more

这里&#39; jsc&#39;是我正在使用的JavaSparkContext对象。 据我所知,JavaSparkContext不是Serializable对象,不应在任何将发送给Spark工作者的函数中使用它。

现在,我无法理解的是,JavaSparkContext的实例是如何发送给工人的?我应该在代码中更改什么以避免这种情况?

3 个答案:

答案 0 :(得分:6)

gson引用是&#39;拉动&#39;将外部类放入闭包的范围内,用它来获取完整的对象图。

在这种情况下,在闭包中创建gson对象:

public SupplierDTO call(String str) throws Exception {   
   Gson gson = Gson();
   return gson.fromJson(str, SupplierDTO.class);
}

您还可以声明spark上下文transient

如果创建Gson实例的代价很高,请考虑使用mapPartitions代替map

答案 1 :(得分:4)

对我来说,我使用以下选项之一解决了这个问题:

  1. 如上所述,将SparkContext声明为transient
  2. 你也可以尝试使对象gson静态 static Gson gson = new Gson();
  3. 请参阅文档Job aborted due to stage failure: Task not serializable

    查看解决此问题的其他可用选择

答案 2 :(得分:0)

您可以使用以下代码而不是第9行。(return gson.fromJson(str, SupplierDTO.class);

return new Gson().fromJson(str, SupplierDTO.class);//this is correct

并删除第1行。(Gson gson = new Gson();