使用这些内置方法在Spark I中加载文件:
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);
或
JavaPairRDD<String, String> miao = jsc.wholeTextFiles(SOURCE_PATH);
我有一个字节或字符串表示我从文件夹中取出的文件,它存储在PairRDD的值中。密钥包含文件名 如何获取这些文件的详细信息?喜欢
File miao = new File(path);
//this kind of details
String date = miao.getLastModified();
我应该将它们重新转换回File然后读取然后再将它们转换为另一个bytearray吗?是否有更快的过程?
答案 0 :(得分:1)
您可以编写自定义输入格式,并将该inputFormatClass传递给SparkContext上的newApiHadoopFile方法。此inputFormat将使用自定义RecordReader,自定义recordReader将读取fileContent以及其他文件相关信息(即author,modifiedDate等...)。您需要编写一个自定义Writable类来保存文件信息和记录阅读器读取的fileContent。
完整的工作代码如下。此代码使用名为 RichFileInputFormat 的客户输入格式类。 RichFileInputFormat是一个wholeFileInputFormat,这意味着每个输入文件只有一个分割。这进一步意味着rdd分区的数量将等于输入文件的数量。因此,如果您的输入路径包含10个文件,那么无论输入文件的大小如何,生成的rdd都将包含10个分区。
这是您可以从SparkContext调用此自定义inputFormat来加载文件的方法: -
JavaPairRDD<Text, FileInfoWritable> rdd = sc.newAPIHadoopFile(args[1], RichFileInputFormat.class, Text.class,FileInfoWritable.class, new Configuration());
所以你的rdd键将是filePath,值将是FileInfoWritable,它包含文件内容和其他文件相关信息。
下面粘贴了完整的工作代码: -
自定义输入格式类
package nk.stackoverflow.spark;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class RichFileInputFormat extends FileInputFormat<Text, FileInfoWritable> {
@Override
public RecordReader<Text, FileInfoWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
return new RichFileRecordReader();
}
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
包nk.stackoverflow.spark;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataInputStream; import
org.apache.hadoop.fs.FileStatus; import
org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.InputSplit; import
org.apache.hadoop.mapreduce.RecordReader; import
org.apache.hadoop.mapreduce.TaskAttemptContext; import
org.apache.hadoop.mapreduce.lib.input.FileSplit; import
org.apache.spark.deploy.SparkHadoopUtil;
public class RichFileRecordReader extends RecordReader<Text,
FileInfoWritable> { private String author; private String
createdDate; private String owner; private String lastModified;
private String content; private boolean processed;
private Text key; private Path path; private FileSystem fs;
public RichFileRecordReader() {
}
@Override public void initialize(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException
{ // this.recordReader.initialize(split, context); final
FileSplit fileSplit = (FileSplit) split; final Path path =
fileSplit.getPath(); this.fs =
path.getFileSystem(SparkHadoopUtil.get().getConfigurationFromJobContext(context));
final FileStatus stat = this.fs.getFileStatus(path); this.path =
path; this.author = stat.getOwner(); this.createdDate =
String.valueOf(stat.getModificationTime()); this.lastModified =
String.valueOf(stat.getAccessTime()); this.key = new
Text(path.toString()); }
@Override public boolean nextKeyValue() throws IOException,
InterruptedException { // TODO Auto-generated method stub
FSDataInputStream stream = null; try { if (!processed) {
int len = (int) this.fs.getFileStatus(this.path).getLen();
final byte[] data = new byte[len];
stream = this.fs.open(this.path);
int read = stream.read(data);
String content = new String(data, 0, read);
this.content = content;
processed = true;
return true; } } catch (IOException e) { e.printStackTrace(); if (stream != null) {
try {
stream.close();
} catch (IOException ie) {
ie.printStackTrace();
} } } return false; }
@Override public Text getCurrentKey() throws IOException,
InterruptedException { // TODO Auto-generated method stub return
this.key; }
@Override public FileInfoWritable getCurrentValue() throws
IOException, InterruptedException { // TODO Auto-generated method
stub
final FileInfoWritable fileInfo = new FileInfoWritable();
fileInfo.setContent(this.content);
fileInfo.setAuthor(this.author);
fileInfo.setCreatedDate(this.createdDate);
fileInfo.setOwner(this.owner);
fileInfo.setPath(this.path.toString()); return fileInfo; }
@Override public float getProgress() throws IOException,
InterruptedException { // TODO Auto-generated method stub return
processed ? 1.0f : 0.0f; }
@Override public void close() throws IOException { // TODO
Auto-generated method stub
}
}
包nk.stackoverflow.spark;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.nio.charset.Charset;
import org.apache.hadoop.io.Writable;
import com.google.common.base.Charsets;
public class FileInfoWritable implements Writable {
private final static Charset CHARSET = Charsets.UTF_8;
private String createdDate;
private String owner;
// private String lastModified;
private String content;
private String path;
public FileInfoWritable() {
}
public void readFields(DataInput in) throws IOException {
this.createdDate = readString(in);
this.owner = readString(in);
// this.lastModified = readString(in);
this.content = readString(in);
this.path = readString(in);
}
public void write(DataOutput out) throws IOException {
writeString(createdDate, out);
writeString(owner, out);
// writeString(lastModified, out);
writeString(content, out);
writeString(path, out);
}
private String readString(DataInput in) throws IOException {
final int n = in.readInt();
final byte[] content = new byte[n];
in.readFully(content);
return new String(content, CHARSET);
}
private void writeString(String str, DataOutput out) throws IOException {
out.writeInt(str.length());
out.write(str.getBytes(CHARSET));
}
public String getCreatedDate() {
return createdDate;
}
public void setCreatedDate(String createdDate) {
this.createdDate = createdDate;
}
public String getAuthor() {
return owner;
}
public void setAuthor(String author) {
this.owner = author;
}
/*public String getLastModified() {
return lastModified;
}*/
/*public void setLastModified(String lastModified) {
this.lastModified = lastModified;
}*/
public String getOwner() {
return owner;
}
public void setOwner(String owner) {
this.owner = owner;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
public String getPath() {
return path;
}
public void setPath(String path) {
this.path = path;
}
}
包nk.stackoverflow.spark;
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.io.Text; import org.apache.spark.SparkConf; import
org.apache.spark.api.java.JavaPairRDD; import
org.apache.spark.api.java.JavaSparkContext; import
org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class CustomInputFormat { public static void main(String[]
args) {
SparkConf conf = new SparkConf();
conf.setAppName(args[0]);
conf.setMaster("local[*]");
final String inputPath = args[1];
JavaSparkContext sc = new
JavaSparkContext(conf);
JavaPairRDD<Text, FileInfoWritable> rdd = sc.newAPIHadoopFile(inputPath, RichFileInputFormat.class,
Text.class,
FileInfoWritable.class, new Configuration());
rdd.foreach(new VoidFunction<Tuple2<Text, FileInfoWritable>>() {
public void call(Tuple2<Text, FileInfoWritable> t) throws
Exception {
final Text filePath = t._1();
final String fileContent = t._2().getContent();
System.out.println("file " + filePath + " has contents= " + fileContent); } });
sc.close(); } }
答案 1 :(得分:0)
使用地图转换解析此RDD。在map函数内部调用一个接受String(即文件名)的函数,并使用该String打开并处理该文件。所以它只不过是一个map RDD转换,它为这个RDD的每一行调用一个函数。