调用JavaPairRDD.max时Spark中的Task Not Serializable异常

时间:2016-02-14 06:58:22

标签: java serialization apache-spark

在ItelliJ上运行时,我遇到了一些例外:线程中的异常" main" org.apache.spark.SparkException:任务不可序列化 代码段:

`

public class MostPopularSuperHero {

public static void main(String args[]) {
    SparkConf conf = new SparkConf().setAppName("MostPopularSuperHero").setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(conf);

    class HrDict {
        Map<Integer, String> getHeroDict() {
            Map<Integer, String> heroDict = new HashMap<>();
            BufferedReader br = null;

            try {

                String sCurrentLine;

                br = new BufferedReader(new FileReader("/Users/11130/udemy/SparkCourse/Marvel-Names.txt"));

                while ((sCurrentLine = br.readLine()) != null) {
                    String str = sCurrentLine;
                    String[] fields = str.split(" ", 2);
                    heroDict.put(Integer.parseInt(fields[0]), fields[1]);
                }

            } catch (IOException e) {
                e.printStackTrace();
            }

            return heroDict;
        }
    }

    class DummyComparator implements Comparator<Tuple2<Integer, String> > {
        @Override
        public int compare(Tuple2<Integer, String> o1, Tuple2<Integer, String> o2) {
            return Integer.compare(o1._1(), o2._1());
        }
    }

    Broadcast<Map<Integer, String> > heroDict = sc.broadcast(new HrDict().getHeroDict());
    JavaRDD<String> lines = sc.textFile("/Users/11130/udemy/SparkCourse/Marvel-Graph.txt");

    JavaPairRDD<Integer, Integer> countOfOccurences = lines.mapToPair(
            s -> {
                String[] heroes = s.split(" ");
                return new Tuple2<>(Integer.parseInt(heroes[0]), heroes.length - 1);
            }
    ).reduceByKey(
            (x, y) -> x + y
    );

    JavaPairRDD<Integer, String> flippedCountOfOccurences = countOfOccurences.mapToPair(
            s -> new Tuple2<>(s._2(), heroDict.getValue().get(s._1()))
    );



    Tuple2<Integer, String> result = flippedCountOfOccurences.max(new DummyComparator());

    System.out.println("The most populat superhero is " + result._2() + " with " + result._1() + " number of occurences");

}}

`

错误堆栈跟踪:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1008)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
    at org.apache.spark.rdd.RDD$$anonfun$max$1.apply(RDD.scala:1396)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.max(RDD.scala:1395)
    at org.apache.spark.api.java.JavaRDDLike$class.max(JavaRDDLike.scala:602)
    at org.apache.spark.api.java.AbstractJavaRDDLike.max(JavaRDDLike.scala:46)
    at MostPopularSuperHero.main(MostPopularSuperHero.java:73)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.io.NotSerializableException: MostPopularSuperHero$1DummyComparator
Serialization stack:
    - object not serializable (class: MostPopularSuperHero$1DummyComparator, value: MostPopularSuperHero$1DummyComparator@72fb0cb3)
    - field (class: scala.math.LowPriorityOrderingImplicits$$anon$7, name: cmp$2, type: interface java.util.Comparator)
    - object (class scala.math.LowPriorityOrderingImplicits$$anon$7, scala.math.LowPriorityOrderingImplicits$$anon$7@4468fdae)
    - field (class: org.apache.spark.rdd.RDD$$anonfun$max$1, name: ord$10, type: interface scala.math.Ordering)
    - object (class org.apache.spark.rdd.RDD$$anonfun$max$1, <function0>)
    - field (class: org.apache.spark.rdd.RDD$$anonfun$max$1$$anonfun$apply$51, name: $outer, type: class org.apache.spark.rdd.RDD$$anonfun$max$1)
    - object (class org.apache.spark.rdd.RDD$$anonfun$max$1$$anonfun$apply$51, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 21 more

1 个答案:

答案 0 :(得分:1)

这是我使用的(基本上我们需要实现可序列化的)

class DummyComparator implements Serializable, Comparator<Tuple2<Integer, String> >{
            @Override
            public int compare(Tuple2<Integer, String> o1, Tuple2<Integer, String> o2) {
                return Integer.compare(o1._1(), o2._1());
            }
        }