任务不可序列化-Java 1.8和Spark 2.1.1

时间:2018-08-18 09:51:44

标签: java apache-spark

我在使用Java 8和Spark 2.1.1时遇到问题

我有一个(有效)正则表达式,保存在称为“模式”的变量中。当我尝试使用此变量过滤从文本文件加载的内容时,将引发SparkException:任务不可序列化。谁能帮我?这是代码:

  JavaRDD<String> lines = sc.textFile(path);
  JavaRDD<String> filtered = lines.filter(new Function<String, Boolean>() {
        @Override
        public Boolean call(String v1) throws Exception {
            return v1.contains(pattern);
        }
    });

这是错误堆栈

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)
at FileReader.filteredRDD(FileReader.java:47)
at FileReader.main(FileReader.java:68)

Caused by: java.io.NotSerializableException: FileReader
Serialization stack:
- object not serializable (class: FileReader, value: FileReader@6107165)
- field (class: FileReader$1, name: this$0, type: class FileReader)
- object (class FileReader$1, FileReader$1@7c447c76)
- field (class: org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, name: f$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)

2 个答案:

答案 0 :(得分:1)

根据spark生成的不可序列化报告

 - object not serializable (class: FileReader, 
   value: FileReader@6107165)
 - field (class: FileReader$1, name: this$0, type: 
         class FileReader)
 - object (class FileReader$1, 
           FileReader$1@7c447c76)

建议闭包不可序列化的类中的FileReader。当spark无法仅序列化该方法时,就会发生这种情况。 Spark看到了这一点,并且由于方法无法单独进行序列化,因此Spark尝试对整个类进行序列化。

在您的代码中,我假设变量pattern是一个类变量。这引起了问题。 Spark不确定如何序列化pattern而不序列化整个类。

  

尝试将模式作为局部变量传递给闭包,这将起作用。

答案 1 :(得分:0)

尝试使您的课程可序列化