获取使用Spark

时间:2016-06-30 09:02:25

标签: java apache-spark spark-streaming

使用这些内置方法在Spark I中加载文件:

JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);

JavaPairRDD<String, String> miao = jsc.wholeTextFiles(SOURCE_PATH);

我有一个字节或字符串表示我从文件夹中取出的文件,它存储在PairRDD的值中。密钥包含文件名 如何获取这些文件的详细信息?喜欢

File miao = new File(path);
//this kind of details
String date = miao.getLastModified();

我应该将它们重新转换回File然后读取然后再将它们转换为另一个bytearray吗?是否有更快的过程?

2 个答案:

答案 0 :(得分:1)

您可以编写自定义输入格式,并将该inputFormatClass传递给SparkContext上的newApiHadoopFile方法。此inputFormat将使用自定义RecordReader,自定义recordReader将读取fileContent以及其他文件相关信息(即author,modifiedDate等...)。您需要编写一个自定义Writable类来保存文件信息和记录阅读器读取的fileContent。

完整的工作代码如下。此代码使用名为 RichFileInputFormat 的客户输入格式类。 RichFileInputFormat是一个wholeFileInputFormat,这意味着每个输入文件只有一个分割。这进一步意味着rdd分区的数量将等于输入文件的数量。因此,如果您的输入路径包含10个文件,那么无论输入文件的大小如何,生成的rdd都将包含10个分区。

这是您可以从SparkContext调用此自定义inputFormat来加载文件的方法: -

JavaPairRDD<Text, FileInfoWritable> rdd = sc.newAPIHadoopFile(args[1],    RichFileInputFormat.class, Text.class,FileInfoWritable.class, new Configuration());

所以你的rdd键将是filePath,值将是FileInfoWritable,它包含文件内容和其他文件相关信息。

下面粘贴了完整的工作代码: -

  1. 自定义输入格式类

           package nk.stackoverflow.spark;
    
           import java.io.IOException;
    
           import org.apache.hadoop.fs.Path;
           import org.apache.hadoop.io.Text;
           import org.apache.hadoop.mapreduce.InputSplit;
           import org.apache.hadoop.mapreduce.JobContext;
           import org.apache.hadoop.mapreduce.RecordReader;
           import org.apache.hadoop.mapreduce.TaskAttemptContext;
           import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    
           public class RichFileInputFormat extends FileInputFormat<Text, FileInfoWritable> {
    
            @Override
            public RecordReader<Text, FileInfoWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
                    throws IOException, InterruptedException {
    
                return new RichFileRecordReader();
            }
    
            protected boolean isSplitable(JobContext context, Path filename) {
                return false;
            }
           }
    
    1. 记录阅读器
    2. 包nk.stackoverflow.spark;

      import java.io.IOException;
      
      import org.apache.hadoop.fs.FSDataInputStream; import
      org.apache.hadoop.fs.FileStatus; import
      org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.Text; import
      org.apache.hadoop.mapreduce.InputSplit; import
      org.apache.hadoop.mapreduce.RecordReader; import
      org.apache.hadoop.mapreduce.TaskAttemptContext; import
      org.apache.hadoop.mapreduce.lib.input.FileSplit; import
      org.apache.spark.deploy.SparkHadoopUtil;
      
      public class RichFileRecordReader extends RecordReader<Text,
      FileInfoWritable> {     private String author;  private String
      createdDate;    private String owner;   private String lastModified;
          private String content;     private boolean processed;
      
          private Text key;   private Path path;  private FileSystem fs;
      
          public RichFileRecordReader() {
      
          }
      
          @Override   public void initialize(InputSplit split,
      TaskAttemptContext context) throws IOException, InterruptedException
      {       // this.recordReader.initialize(split, context);        final
      FileSplit fileSplit = (FileSplit) split;        final Path path =
      fileSplit.getPath();        this.fs =
      path.getFileSystem(SparkHadoopUtil.get().getConfigurationFromJobContext(context));
              final FileStatus stat = this.fs.getFileStatus(path);        this.path =
      path;       this.author = stat.getOwner();      this.createdDate =
      String.valueOf(stat.getModificationTime());         this.lastModified =
      String.valueOf(stat.getAccessTime());       this.key = new
      Text(path.toString());  }
      
          @Override   public boolean nextKeyValue() throws IOException,
      InterruptedException {      // TODO Auto-generated method stub
              FSDataInputStream stream = null;        try {           if (!processed) {
                      int len = (int) this.fs.getFileStatus(this.path).getLen();
                      final byte[] data = new byte[len];
      
                      stream = this.fs.open(this.path);
                      int read = stream.read(data);
                      String content = new String(data, 0, read);
                      this.content = content;
                      processed = true;
                      return true;            }       } catch (IOException e) {           e.printStackTrace();            if (stream != null) {
                      try {
                          stream.close();
                      } catch (IOException ie) {
                          ie.printStackTrace();
                      }           }       }       return false;   }
      
          @Override   public Text getCurrentKey() throws IOException,
      InterruptedException {      // TODO Auto-generated method stub      return
      this.key;   }
      
          @Override   public FileInfoWritable getCurrentValue() throws
      IOException, InterruptedException {         // TODO Auto-generated method
      stub
      
              final FileInfoWritable fileInfo = new FileInfoWritable();
              fileInfo.setContent(this.content);
              fileInfo.setAuthor(this.author);
              fileInfo.setCreatedDate(this.createdDate);
              fileInfo.setOwner(this.owner);
              fileInfo.setPath(this.path.toString());         return fileInfo;    }
      
          @Override   public float getProgress() throws IOException,
      InterruptedException {      // TODO Auto-generated method stub      return
      processed ? 1.0f : 0.0f;    }
      
          @Override   public void close() throws IOException {        // TODO
      Auto-generated method stub
      
          }
      
      }
      
      1. 可写课程
      2. 包nk.stackoverflow.spark;

            import java.io.DataInput;
            import java.io.DataOutput;
            import java.io.IOException;
            import java.nio.charset.Charset;
        
            import org.apache.hadoop.io.Writable;
        
            import com.google.common.base.Charsets;
        
            public class FileInfoWritable implements Writable {
                private final static Charset CHARSET = Charsets.UTF_8;
                private String createdDate;
                private String owner;
            //  private String lastModified;
                private String content;
                private String path;
                public FileInfoWritable() {
        
                }
        
                public void readFields(DataInput in) throws IOException {
                    this.createdDate = readString(in);
                    this.owner = readString(in);
            //      this.lastModified = readString(in);
                    this.content = readString(in);
                    this.path = readString(in);
                }
        
                public void write(DataOutput out) throws IOException {
                    writeString(createdDate, out);
                    writeString(owner, out);
            //      writeString(lastModified, out);
                    writeString(content, out);
                    writeString(path, out);
                }
        
                private String readString(DataInput in) throws IOException {
                    final int n = in.readInt();
                    final byte[] content = new byte[n];
                    in.readFully(content);
                    return new String(content, CHARSET);
                }
        
                private void writeString(String str, DataOutput out) throws IOException {
                    out.writeInt(str.length());
                    out.write(str.getBytes(CHARSET));
                }
        
                public String getCreatedDate() {
                    return createdDate;
                }
        
                public void setCreatedDate(String createdDate) {
                    this.createdDate = createdDate;
                }
        
                public String getAuthor() {
                    return owner;
                }
        
                public void setAuthor(String author) {
                    this.owner = author;
                }
        
                /*public String getLastModified() {
                    return lastModified;
                }*/
        
                /*public void setLastModified(String lastModified) {
                    this.lastModified = lastModified;
                }*/
        
                public String getOwner() {
                    return owner;
                }
        
                public void setOwner(String owner) {
                    this.owner = owner;
                }
        
                public String getContent() {
                    return content;
                }
        
                public void setContent(String content) {
                    this.content = content;
                }
        
                public String getPath() {
                    return path;
                }
        
                public void setPath(String path) {
                    this.path = path;
                }
        
        
            }
        
        1. 主要课程显示如何使用
        2. 包nk.stackoverflow.spark;

          import org.apache.hadoop.conf.Configuration; import
          org.apache.hadoop.io.Text; import org.apache.spark.SparkConf; import
          org.apache.spark.api.java.JavaPairRDD; import
          org.apache.spark.api.java.JavaSparkContext; import
          org.apache.spark.api.java.function.VoidFunction;
          
          import scala.Tuple2;
          
          public class CustomInputFormat {    public static void main(String[]
          args) {         
                  SparkConf conf = new SparkConf();
          
                  conf.setAppName(args[0]);   
                  conf.setMaster("local[*]");         
                  final String inputPath = args[1]; 
          JavaSparkContext sc = new
          JavaSparkContext(conf);         
          JavaPairRDD<Text, FileInfoWritable> rdd = sc.newAPIHadoopFile(inputPath, RichFileInputFormat.class,
          Text.class,
                          FileInfoWritable.class, new Configuration());
          
                  rdd.foreach(new VoidFunction<Tuple2<Text, FileInfoWritable>>() {
          
                      public void call(Tuple2<Text, FileInfoWritable> t) throws
          Exception {
                          final Text filePath = t._1();
                          final String fileContent = t._2().getContent();
                          System.out.println("file " + filePath + " has contents= " + fileContent);           }       });
          
                  sc.close();     } }
          

答案 1 :(得分:0)

使用地图转换解析此RDD。在map函数内部调用一个接受String(即文件名)的函数,并使用该String打开并处理该文件。所以它只不过是一个map RDD转换,它为这个RDD的每一行调用一个函数。