我正在尝试加载一组文件,对它们进行一些检查,然后将它们保存在HDFS中。但是,我还没有找到创建和保存这些序列文件的好方法。这是我的装载机主要功能
SparkConf sparkConf = new SparkConf().setAppName("writingHDFS")
.setMaster("local[2]")
.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5*1000));
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
JavaPairRDD<String, String> imageRDD = jsc.wholeTextFiles("file:///home/cloudera/Pictures/");
imageRDD.mapToPair(new PairFunction<Tuple2<String,String>, Text, Text>() {
@Override
public Tuple2<Text, Text> call(Tuple2<String, String> arg0)
throws Exception {
return new Tuple2<Text, Text>(new Text(arg0._1),new Text(arg0._2));
}
}).saveAsNewAPIHadoopFile("hdfs://localhost:8020/user/hdfs/sparkling/try.seq", Text.class, Text.class, SequenceFileOutputFormat.class);
它只是将一些图像作为文本文件加载,将文件名作为PairRDD的键,并使用原生的saveAsNewAPIHadoopFile.
rdd.foreach或rdd.foreachPartition`但我找不到合适的方法:
I would like now to save file by file in a
OutputStream out = fs.create(new Path(dst));
)创建了一个目录,如果我没有得到异常,那么这个目录就不会出现问题。 Mkdirs didn't work
编辑:我可能找到了一种方法,但我有一个Task not serializable
例外:
JavaPairRDD imageByteRDD = jsc.binaryFiles(&#34; file:/// home / cloudera / Pictures / cat&#34;);
imageByteRDD.foreach(new VoidFunction<Tuple2<String,PortableDataStream>>() {
@Override
public void call(Tuple2<String, PortableDataStream> fileTuple) throws Exception {
Text key = new Text(fileTuple._1());
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(serializableConfiguration.getConf(), SequenceFile.Writer.file(new Path("/user/hdfs/sparkling/" + key)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text("MiaoMiao!");
writer.append(key, value);
IOUtils.closeStream(writer);
}
});
我已经尝试将整个函数包装在Serializable类中,但没有运气。帮助
答案 0 :(得分:0)
我这样做的方式是(伪代码,我一到办公室就会尝试编辑这个答案)
rdd.foreachPartition{
Configuration conf = ConfigurationSingletonClass.getConfiguration();
etcetera, etcetera...
}
编辑:到了我的办公室,这里是完整的代码段:配置是 里面 rdd.foreachPartition(每个都有点太多了) 。在迭代器中,文件自身编写为序列文件格式。
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);
if(!imageByteRDD.isEmpty())
imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {
@Override
public void call(
Iterator<Tuple2<String, PortableDataStream>> arg0)
throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", HDFS_PATH);
while(arg0.hasNext()){
Tuple2<String,PortableDataStream>fileTuple = arg0.next();
Text key = new Text(fileTuple._1());
String fileName = key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-1].split(DOT_REGEX)[0];
String fileExtension = fileName.split(DOT_REGEX)[fileName.split(DOT_REGEX).length-1];
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(
conf,
SequenceFile.Writer.file(new Path(DEST_PATH + fileName + SEP_KEY + getCurrentTimeStamp()+DOT+fileExtension)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text(key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-2] + SEP_KEY + fileName + SEP_KEY + fileExtension);
writer.append(key, value);
IOUtils.closeStream(writer);
}
}
});
希望这会有所帮助。