Question

使用这些内置方法在Spark I中加载文件：

JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);

或

JavaPairRDD<String, String> miao = jsc.wholeTextFiles(SOURCE_PATH);

我有一个字节或字符串表示我从文件夹中取出的文件，它存储在PairRDD的值中。密钥包含文件名如何获取这些文件的详细信息？喜欢

File miao = new File(path);
//this kind of details
String date = miao.getLastModified();

我应该将它们重新转换回File然后读取然后再将它们转换为另一个bytearray吗？是否有更快的过程？

Answer 1

您可以编写自定义输入格式，并将该inputFormatClass传递给SparkContext上的newApiHadoopFile方法。此inputFormat将使用自定义RecordReader，自定义recordReader将读取fileContent以及其他文件相关信息（即author，modifiedDate等...）。您需要编写一个自定义Writable类来保存文件信息和记录阅读器读取的fileContent。

完整的工作代码如下。此代码使用名为 RichFileInputFormat 的客户输入格式类。 RichFileInputFormat是一个wholeFileInputFormat，这意味着每个输入文件只有一个分割。这进一步意味着rdd分区的数量将等于输入文件的数量。因此，如果您的输入路径包含10个文件，那么无论输入文件的大小如何，生成的rdd都将包含10个分区。

这是您可以从SparkContext调用此自定义inputFormat来加载文件的方法： -

JavaPairRDD<Text, FileInfoWritable> rdd = sc.newAPIHadoopFile(args[1],    RichFileInputFormat.class, Text.class,FileInfoWritable.class, new Configuration());

所以你的rdd键将是filePath，值将是FileInfoWritable，它包含文件内容和其他文件相关信息。

下面粘贴了完整的工作代码： -

自定义输入格式类

       package nk.stackoverflow.spark;

       import java.io.IOException;

       import org.apache.hadoop.fs.Path;
       import org.apache.hadoop.io.Text;
       import org.apache.hadoop.mapreduce.InputSplit;
       import org.apache.hadoop.mapreduce.JobContext;
       import org.apache.hadoop.mapreduce.RecordReader;
       import org.apache.hadoop.mapreduce.TaskAttemptContext;
       import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

       public class RichFileInputFormat extends FileInputFormat<Text, FileInfoWritable> {

        @Override
        public RecordReader<Text, FileInfoWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
                throws IOException, InterruptedException {

            return new RichFileRecordReader();
        }

        protected boolean isSplitable(JobContext context, Path filename) {
            return false;
        }
       }

记录阅读器

包nk.stackoverflow.spark;

import java.io.IOException;

import org.apache.hadoop.fs.FSDataInputStream; import
org.apache.hadoop.fs.FileStatus; import
org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.InputSplit; import
org.apache.hadoop.mapreduce.RecordReader; import
org.apache.hadoop.mapreduce.TaskAttemptContext; import
org.apache.hadoop.mapreduce.lib.input.FileSplit; import
org.apache.spark.deploy.SparkHadoopUtil;

public class RichFileRecordReader extends RecordReader<Text,
FileInfoWritable> {     private String author;  private String
createdDate;    private String owner;   private String lastModified;
    private String content;     private boolean processed;

    private Text key;   private Path path;  private FileSystem fs;

    public RichFileRecordReader() {

    }

    @Override   public void initialize(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException
{       // this.recordReader.initialize(split, context);        final
FileSplit fileSplit = (FileSplit) split;        final Path path =
fileSplit.getPath();        this.fs =
path.getFileSystem(SparkHadoopUtil.get().getConfigurationFromJobContext(context));
        final FileStatus stat = this.fs.getFileStatus(path);        this.path =
path;       this.author = stat.getOwner();      this.createdDate =
String.valueOf(stat.getModificationTime());         this.lastModified =
String.valueOf(stat.getAccessTime());       this.key = new
Text(path.toString());  }

    @Override   public boolean nextKeyValue() throws IOException,
InterruptedException {      // TODO Auto-generated method stub
        FSDataInputStream stream = null;        try {           if (!processed) {
                int len = (int) this.fs.getFileStatus(this.path).getLen();
                final byte[] data = new byte[len];

                stream = this.fs.open(this.path);
                int read = stream.read(data);
                String content = new String(data, 0, read);
                this.content = content;
                processed = true;
                return true;            }       } catch (IOException e) {           e.printStackTrace();            if (stream != null) {
                try {
                    stream.close();
                } catch (IOException ie) {
                    ie.printStackTrace();
                }           }       }       return false;   }

    @Override   public Text getCurrentKey() throws IOException,
InterruptedException {      // TODO Auto-generated method stub      return
this.key;   }

    @Override   public FileInfoWritable getCurrentValue() throws
IOException, InterruptedException {         // TODO Auto-generated method
stub

        final FileInfoWritable fileInfo = new FileInfoWritable();
        fileInfo.setContent(this.content);
        fileInfo.setAuthor(this.author);
        fileInfo.setCreatedDate(this.createdDate);
        fileInfo.setOwner(this.owner);
        fileInfo.setPath(this.path.toString());         return fileInfo;    }

    @Override   public float getProgress() throws IOException,
InterruptedException {      // TODO Auto-generated method stub      return
processed ? 1.0f : 0.0f;    }

    @Override   public void close() throws IOException {        // TODO
Auto-generated method stub

    }

}

可写课程

包nk.stackoverflow.spark;

    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    import java.nio.charset.Charset;

    import org.apache.hadoop.io.Writable;

    import com.google.common.base.Charsets;

    public class FileInfoWritable implements Writable {
        private final static Charset CHARSET = Charsets.UTF_8;
        private String createdDate;
        private String owner;
    //  private String lastModified;
        private String content;
        private String path;
        public FileInfoWritable() {

        }

        public void readFields(DataInput in) throws IOException {
            this.createdDate = readString(in);
            this.owner = readString(in);
    //      this.lastModified = readString(in);
            this.content = readString(in);
            this.path = readString(in);
        }

        public void write(DataOutput out) throws IOException {
            writeString(createdDate, out);
            writeString(owner, out);
    //      writeString(lastModified, out);
            writeString(content, out);
            writeString(path, out);
        }

        private String readString(DataInput in) throws IOException {
            final int n = in.readInt();
            final byte[] content = new byte[n];
            in.readFully(content);
            return new String(content, CHARSET);
        }

        private void writeString(String str, DataOutput out) throws IOException {
            out.writeInt(str.length());
            out.write(str.getBytes(CHARSET));
        }

        public String getCreatedDate() {
            return createdDate;
        }

        public void setCreatedDate(String createdDate) {
            this.createdDate = createdDate;
        }

        public String getAuthor() {
            return owner;
        }

        public void setAuthor(String author) {
            this.owner = author;
        }

        /*public String getLastModified() {
            return lastModified;
        }*/

        /*public void setLastModified(String lastModified) {
            this.lastModified = lastModified;
        }*/

        public String getOwner() {
            return owner;
        }

        public void setOwner(String owner) {
            this.owner = owner;
        }

        public String getContent() {
            return content;
        }

        public void setContent(String content) {
            this.content = content;
        }

        public String getPath() {
            return path;
        }

        public void setPath(String path) {
            this.path = path;
        }


    }

主要课程显示如何使用

包nk.stackoverflow.spark;

import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.io.Text; import org.apache.spark.SparkConf; import
org.apache.spark.api.java.JavaPairRDD; import
org.apache.spark.api.java.JavaSparkContext; import
org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

public class CustomInputFormat {    public static void main(String[]
args) {         
        SparkConf conf = new SparkConf();

        conf.setAppName(args[0]);   
        conf.setMaster("local[*]");         
        final String inputPath = args[1]; 
JavaSparkContext sc = new
JavaSparkContext(conf);         
JavaPairRDD<Text, FileInfoWritable> rdd = sc.newAPIHadoopFile(inputPath, RichFileInputFormat.class,
Text.class,
                FileInfoWritable.class, new Configuration());

        rdd.foreach(new VoidFunction<Tuple2<Text, FileInfoWritable>>() {

            public void call(Tuple2<Text, FileInfoWritable> t) throws
Exception {
                final Text filePath = t._1();
                final String fileContent = t._2().getContent();
                System.out.println("file " + filePath + " has contents= " + fileContent);           }       });

        sc.close();     } }

Answer 2

使用地图转换解析此RDD。在map函数内部调用一个接受String（即文件名）的函数，并使用该String打开并处理该文件。所以它只不过是一个map RDD转换，它为这个RDD的每一行调用一个函数。

获取使用Spark

2 个答案: