由于不可序列化的对象,Spark作业失败

时间:2017-06-21 02:24:07

标签: java hadoop apache-spark hbase hfile

我正在运行一个Spark作业,为HFiles数据存储生成HBase

以前我的Cloudera群集工作正常,但是当我们切换到EMR群集时,它会失败并显示以下堆栈跟踪:

Serialization stack:
    - object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35); not retrying


Serialization stack:
    - object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1505)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1493)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1492)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1492)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1720)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1158)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
    at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:823)

我的问题:

  1. 什么可能导致两次运行之间的差异?两个集群之间的版本差异?
  2. 我做了研究并找到了这个post:然后我将Kyro参数添加到我的spark-submit命令中,现在我的命令如下所示: spark-submit --conf spark.kryo.classesToRegister=org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.KeyValue --master yarn --deploy-mode client --driver-memory 16G --executor-memory 18G ...但我仍然遇到同样的错误。
  3. 这是我的Java代码:

    protected void generateHFilesUsingSpark(JavaRDD<Row> rdd) throws Exception {
            JavaPairRDD<ImmutableBytesWritable, KeyValue> javaPairRdd = rdd.mapToPair(
                new PairFunction<Row, ImmutableBytesWritable, KeyValue>() {
                    public Tuple2<ImmutableBytesWritable, KeyValue> call(Row row) throws Exception {
                        String key = (String) row.get(0);
                        String value = (String) row.get(1);
    
                        ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
                        byte[] rowKeyBytes = Bytes.toBytes(key);
                        rowKey.set(rowKeyBytes);
    
                        KeyValue keyValue = new KeyValue(rowKeyBytes,
                            Bytes.toBytes("COL"),
                            Bytes.toBytes("FM"),
                            ProductJoin.newBuilder()
                                .setId(key)
                                .setSolrJson(value)
                                .build().toByteArray());
    
                        return new Tuple2<ImmutableBytesWritable, KeyValue>(rowKey, keyValue);
                    }
                });
    
            Configuration baseConf = HBaseConfiguration.create();
            Configuration conf = new Configuration();
            conf.set(HBASE_ZOOKEEPER_QUORUM, "xxx.xxx.xx.xx");
            Job job = new Job(baseConf, "APP-NAME");
            HTable table = new HTable(conf, "hbaseTargetTable");
            Partitioner partitioner = new IntPartitioner(importerParams.shards);
            JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
                javaPairRdd.repartitionAndSortWithinPartitions(partitioner);
            HFileOutputFormat2.configureIncrementalLoad(job, table);
            System.out.println("Done configuring incremental load....");
    
            Configuration config = job.getConfiguration();
    
            repartitionedRdd.saveAsNewAPIHadoopFile(
                "hfilesOutputPath",
                ImmutableBytesWritable.class,
                KeyValue.class,
                HFileOutputFormat2.class,
                config
            );
            System.out.println("Saved to HFiles to: " + importerParams.hfilesOutputPath);
    }
    

1 个答案:

答案 0 :(得分:0)

好了,问题解决了,诀窍是使用KyroSerializer,我在我的Java代码中添加了这个来注册ImmutableBytesWritable。

        SparkSession.Builder builder = SparkSession.builder().appName("AWESOME");
        builder.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
        SparkConf conf = new SparkConf().setAppName("AWESOME");
        Class<?>[] classes = new Class[]{org.apache.hadoop.hbase.io.ImmutableBytesWritable.class};
        conf.registerKryoClasses(classes);
        builder.config(conf);
        SparkSession spark = builder.getOrCreate();