在RDD方法/闭包中使用SparkContext hadoop配置,例如foreachPartition

时间:2016-07-06 12:33:41

标签: java hadoop apache-spark rdd

我使用Spark来读取一堆文件,详细说明它们,然后将它们全部保存为Sequence文件。我想要的是每个分区有1个序列文件,所以我这样做了:

SparkConf sparkConf = new SparkConf().setAppName("writingHDFS")
                .setMaster("local[2]")
                .set("spark.streaming.stopGracefullyOnShutdown", "true");
        final JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        jsc.hadoopConfiguration().addResource(hdfsConfPath + "hdfs-site.xml");
        jsc.hadoopConfiguration().addResource(hdfsConfPath + "core-site.xml");
        //JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5*1000));

        JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(sourcePath);
        if(!imageByteRDD.isEmpty())
            imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {

                @Override
                public void call(Iterator<Tuple2<String, PortableDataStream>> arg0){
                        throws Exception {
                  [°°°SOME STUFF°°°]
                  SequenceFile.Writer writer = SequenceFile.createWriter(
                                     jsc.hadoopConfiguration(), 
//here lies the problem: how to pass the hadoopConfiguration I have put inside the Spark Context? 
Previously, I created a Configuration for each partition, and it works, but I'm sure there is a much more "sparky way"

有人知道如何在 RDD闭包中使用Hadoop配置对象吗?

4 个答案:

答案 0 :(得分:14)

这里的问题是Hadoop配置没有被标记为Serializable,所以Spark不会将它们拉入RDD。它们标记为Writable,因此Hadoop的序列化机制可以对它们进行编组和解组,但Spark并不能直接使用它

两个长期修复选项是

  1. 添加支持以在Spark中序列化可写入文件。也许SPARK-2421
  2. 使Hadoop配置可序列化。
  3. 添加对序列化Hadoop配置的明确支持。
  4. 你不会对Hadoop conf可序列化产生任何重大反对意见;如果您实现了委托给可写IO调用的自定义ser / deser方法(并且只是遍历所有键/值对)。我说这是一个Hadoop提交者。

    更新:这里是创建一个serlializable类的代码,它可以编组Hadoop配置的内容。使用val ser = new ConfSerDeser(hadoopConf)创建它;在您的RDD中将其称为ser.get()

    /*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *    http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
     import org.apache.hadoop.conf.Configuration
    
    /**
     * Class to make Hadoop configurations serializable; uses the
     * `Writeable` operations to do this.
     * Note: this only serializes the explicitly set values, not any set
     * in site/default or other XML resources.
     * @param conf
     */
    class ConfigSerDeser(var conf: Configuration) extends Serializable {
    
      def this() {
        this(new Configuration())
      }
    
      def get(): Configuration = conf
    
      private def writeObject (out: java.io.ObjectOutputStream): Unit = {
        conf.write(out)
      }
    
      private def readObject (in: java.io.ObjectInputStream): Unit = {
        conf = new Configuration()
        conf.readFields(in)
      }
    
      private def readObjectNoData(): Unit = {
        conf = new Configuration()
      }
    }
    

    请注意,有人为所有可写类制作此通用名称会相对简单;您只需要在构造函数中提供一个类名,并在反序列化期间使用它来实例化可写函数。

答案 1 :(得分:4)

您可以使用org.apache.hadoop.conf.Configurationorg.apache.spark.SerializableWritable进行序列化和反序列化。

例如:

import org.apache.spark.SerializableWritable

...

val hadoopConf = spark.sparkContext.hadoopConfiguration
// serialize here
val serializedConf = new SerializableWritable(hadoopConf)


// then access the conf by calling .value on serializedConf
rdd.map(someFunction(serializedConf.value))

答案 2 :(得分:3)

这是一个java实现,根据@ Steve的答案。

import java.io.Serializable;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;


public class SerializableHadoopConfiguration implements Serializable {
    Configuration conf;

    public SerializableHadoopConfiguration(Configuration hadoopConf) {
        this.conf = hadoopConf;

        if (this.conf == null) {
            this.conf = new Configuration();
        }
    }

    public SerializableHadoopConfiguration() {
        this.conf = new Configuration();
    }

    public Configuration get() {
        return this.conf;
    }

    private void writeObject(java.io.ObjectOutputStream out) throws IOException {
        this.conf.write(out);
    }

    private void readObject(java.io.ObjectInputStream in) throws IOException {
        this.conf = new Configuration();
        this.conf.readFields(in);
    }
}

答案 3 :(得分:1)

看起来无法完成,所以这是我使用的代码:

final hdfsNameNodePath = "hdfs://quickstart.cloudera:8080";

JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(sourcePath);
        if(!imageByteRDD.isEmpty())
            imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {

                @Override
                public void call(Iterator<Tuple2<String, PortableDataStream>> arg0)
                        throws Exception {

                    Configuration conf = new Configuration();
                    conf.set("fs.defaultFS", hdfsNameNodePath);
                    //the string above should be passed as argument
SequenceFile.Writer writer = SequenceFile.createWriter(
                                     conf, 
                                     SequenceFile.Writer.file([***ETCETERA...