Spark无法序列化任务

时间:2015-06-28 22:28:12

标签: java scala serialization

我有一个转变:

JavaRDD<Tuple2<String, Long>> mappedRdd = myRDD.values().map(
    new Function<Pageview, Tuple2<String, Long>>() {
      @Override
      public Tuple2<String, Long> call(Pageview pageview) throws Exception {
        String key = pageview.getUrl().toString();
        Long value = getDay(pageview.getTimestamp());
        return new Tuple2<>(key, value);
      }
    });

网页浏览是一种:Pageview.java

然后我将该类注册到Spark中:

Class[] c = new Class[1];
c[0] = Pageview.class;
sparkConf.registerKryoClasses(c);
  

线程“main”中的异常org.apache.spark.SparkException:任务没有   可序列化的   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:166)     在   org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:158)     在org.apache.spark.SparkContext.clean(SparkContext.scala:1623)at   org.apache.spark.rdd.RDD.map(RDD.scala:286)at   org.apache.spark.api.java.JavaRDDLike $ class.map(JavaRDDLike.scala:89)     在   org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:46)     在   org.apache.gora.tutorial.log.ExampleSpark.run(ExampleSpark.java:100)     在   org.apache.gora.tutorial.log.ExampleSpark.main(ExampleSpark.java:53)   引起:java.io.NotSerializableException:   org.apache.gora.tutorial.log.ExampleSpark序列化堆栈:      - 对象不可序列化(类:org.apache.gora.tutorial.log.ExampleSpark,值:   org.apache.gora.tutorial.log.ExampleSpark@1a2b4497)      - field(class:org.apache.gora.tutorial.log.ExampleSpark $ 1,name:this $ 0,type:class org.apache.gora.tutorial.log.ExampleSpark)      - object(类org.apache.gora.tutorial.log.ExampleSpark $ 1,org.apache.gora.tutorial.log.ExampleSpark$1@4ab2775d)      - field(类:org.apache.spark.api.java.JavaPairRDD $$ anonfun $ toScalaFunction $ 1,   name:fun $ 1,type:interface   org.apache.spark.api.java.function.Function)      - object(类org.apache.spark.api.java.JavaPairRDD $$ anonfun $ toScalaFunction $ 1,   ) 在   org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:38)     在   org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)     在   org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)     在   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:164)     ......还有7个

当我调试代码时,我发现即使有一个名为JavaSerializer.scala的类,也会调用KryoSerializer

PS 1:我不想使用Java Serializer,但在Serializer实施Pageview并不能解决问题。

PS 2:这并不能解决问题:

...
//String key = pageview.getUrl().toString();
//Long value = getDay(pageview.getTimestamp());
String key = "Dummy";
Long value = 1L;
return new Tuple2<>(key, value);
...

1 个答案:

答案 0 :(得分:4)

我使用Java代码多次遇到此问题。虽然我使用的是Java序列化,但是我会创建包含Serializable代码的类,或者如果你不想这样做,我会将函数作为类的静态成员。

以下是解决方案的代码段。

public class Test {
   private static Function s = new Function<Pageview, Tuple2<String, Long>>() {

     @Override
     public Tuple2<String, Long> call(Pageview pageview) throws Exception {
       String key = pageview.getUrl().toString();
       Long value = getDay(pageview.getTimestamp());
       return new Tuple2<>(key, value);
      }
  };
}