Question

我想在Spark作业中使用Kryo序列化。

public class SerializeTest {

    public static class Toto implements Serializable {
        private static final long serialVersionUID = 6369241181075151871L;
        private String a;

        public String getA() {
            return a;
        }

        public void setA(String a) {
            this.a = a;
        }
    }

    private static final PairFunction<Toto, Toto, Integer> WRITABLE_CONVERTOR = new PairFunction<Toto, Toto, Integer>() {
        private static final long serialVersionUID = -7119334882912691587L;

        @Override
        public Tuple2<Toto, Integer> call(Toto input) throws Exception {
            return new Tuple2<Toto, Integer>(input, 1);
        }
    };

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("SerializeTest");
        conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
        conf.registerKryoClasses(new Class<?>[]{Toto[].class});
        JavaSparkContext context = new JavaSparkContext(conf);

        List<Toto> list = new ArrayList<Toto>();
        list.add(new Toto());
        JavaRDD<Toto> cursor = context.parallelize(list, list.size());

        JavaPairRDD<Toto, Integer> writable = cursor.mapToPair(WRITABLE_CONVERTOR);
        writable.saveAsHadoopFile(args[0], Toto.class, Integer.class, SequenceFileOutputFormat.class);

        context.close();
    }

}

但我有这个错误：

java.io.IOException：找不到Key类的序列化程序：'com.test.SerializeTest.Toto'。如果您正在使用自定义序列化，请确保正确配置配置'io.serializations'。在org.apache.hadoop.io.SequenceFile $ Writer.init（SequenceFile.java:1179）在org.apache.hadoop.io.SequenceFile $ Writer。（SequenceFile.java:1094）在org.apache.hadoop.io.SequenceFile.createWriter（SequenceFile.java:273）在org.apache.hadoop.io.SequenceFile.createWriter（SequenceFile.java:530） at org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter（SequenceFileOutputFormat.java:63）在org.apache.spark.SparkHadoopWriter.open（SparkHadoopWriter.scala：90）在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ 13.apply（PairRDDFunctions.scala：1068）在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ 13.apply（PairRDDFunctions.scala：1059）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：61）在org.apache.spark.scheduler.Task.run（Task.scala：64）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：203）在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1142） at java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:617）在java.lang.Thread.run（Thread.java:745） 15/09/21 17:49:14 WARN TaskSetManager：阶段0.0中丢失的任务0.0（TID 0，localhost）：java.io.IOException：无法找到Key类的序列化程序：'com.test.SerializeTest.Toto ”。如果您正在使用自定义序列化，请确保正确配置配置'io.serializations'。在org.apache.hadoop.io.SequenceFile $ Writer.init（SequenceFile.java:1179）在org.apache.hadoop.io.SequenceFile $ Writer。（SequenceFile.java:1094）在org.apache.hadoop.io.SequenceFile.createWriter（SequenceFile.java:273）在org.apache.hadoop.io.SequenceFile.createWriter（SequenceFile.java:530） at org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter（SequenceFileOutputFormat.java:63）在org.apache.spark.SparkHadoopWriter.open（SparkHadoopWriter.scala：90）在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ 13.apply（PairRDDFunctions.scala：1068）在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ 13.apply（PairRDDFunctions.scala：1059）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：61）在org.apache.spark.scheduler.Task.run（Task.scala：64）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：203）在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1142） at java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:617）在java.lang.Thread.run（Thread.java:745）

感谢。

Answer 1

此错误既不与 Spark 也不与 Kryo 相关。

使用 Hadoop输出格式时，您需要确保键和值是Writable的实例。 Hadoop默认不使用Java序列化（你也不想使用它，因为它效率很低）

您可以检查配置中的io.serializations媒体资源，并查看使用过的序列号列表，包括org.apache.hadoop.io.serializer.WritableSerialization

要解决此问题，您的Toto课程必须实施Writable。问题与Integer相同，请使用IntWritable。

Spark作业中的Kryo序列化错误

1 个答案: