Java读取和写入Spark向量到Hdfs

时间:2016-07-19 15:01:11

标签: java hadoop apache-spark hdfs

我向(org.apache.spark.mllib.linalg.Vector)写了Vector HDFS,如下所示

public void writePointsToFile(Path path, FileSystem fs, Configuration conf,
        List<Vector> points) throws IOException {

    SequenceFile.Writer writer = SequenceFile.createWriter(conf,
            Writer.file(path), Writer.keyClass(LongWritable.class),
            Writer.valueClass(Vector.class));

    long recNum = 0;

    for (Vector point : points) {
        writer.append(new LongWritable(recNum++), point);
    }
    writer.close();
}

(不确定我是否采用了正确的方法进行测试)

现在我需要将此文件读为JavaRDD<Vector>,因为我想在Spark Clustering K-mean中使用它但不知道如何执行此操作。

1 个答案:

答案 0 :(得分:0)

Spark直接支持读取Hadoop SequenceFiles。你可以这样做:

JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<LongWritable, Vector> input = 
    sc.sequenceFile(fileName, LongWritable.class, Vector.class);

然后,您只需将JavaPairRDD<LongWritable, Vector>转换为JavaRDD<Vector>

JavaRDD<Vector> out = input.map(new Function<Tuple2<LongWritable, Vector>, Vector>() {

    @Override
    public Vector call(Tuple2<LongWritable, Vector> tuple) throws Exception {
        return tuple._2();
    }
});