Apache Spark java的新手 我有一个由空格分隔的文本文件,如下所示
3,45.25,23.45
5,22.15,19.35
4,33.24,12.45
2,15.67,21.22
这里的列意味着:
我试图将这些数据解析为2或3个RDD(或配对RDD)。到目前为止,这是我的代码:
JavaRDD<String> data = sc.textFile("hdfs://data.txt");
JavaRDD<Double> data1 = data.flatMap(
new FlatMapFunction<String, Double>() {
public Iterable<Double> call(Double data) {
return Arrays.asList(data.split(","));
}
});
答案 0 :(得分:1)
这样的事情(使用Java 8以获得更好的可读性)?
JavaRDD<String> data = sc.textFile("hdfs://data.txt");
JavaRDD<Tuple3<Integer, Float, Float>> parsedData = data.map((line) -> line.split(","))
.map((line) -> new Tuple3<>(parseInt(line[0]), parseFloat(line[1]), parseFloat(line[2])))
.cache(); // Cache parsed to avoid recomputation in subsequent .mapToPair calls
JavaPairRDD<Integer, Float> latitudeByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), line._2()));
JavaPairRDD<Integer, Float> longitudeByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), line._3()));
JavaPairRDD<Integer, Tuple2<Float, Float>> pointByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), new Tuple2<>(line._2(), line._3())));