我正在Apache火花中做一个简单的WordCount示例现在我终于得到了数字我想只过滤掉它中的唯一单词。
public class SparkClass {
public static void main(String[] args) {
String file = "/home/bhaumik/Documents/my";
JavaSparkContext sc = new JavaSparkContext("local", "SimpleApp");
JavaRDD<String> lines = sc.textFile("/home/bhaumik/Documents/myText", 5)
.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String t) throws Exception {
// TODO Auto-generated method stub
return Arrays.asList(t.split(" "));
}
});
JavaPairRDD<String, Integer> pairs = lines.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String t) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(t, 1);
}
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
// TODO Auto-generated method stub
return v1 + v2;
}
});
}
}
答案 0 :(得分:2)
在计数中,您有一个带有密钥及其出现次数的RDD。您现在无法获得最小值,因此您应该减少
Tuple2<String, Integer> minApp = counts.reduce((a, b) -> (a._2 > b._2)? b : a);
答案 1 :(得分:1)
JavaPairRDD<String,Integer> uniqueIP = counts.filter(newFunction<Tuple2<String,Integer>,Boolean()>{
@Override
public Boolean call(Tuple<String, Integer> v1) throws Exception {
return v1._2.equals(1);
}
});
这就是我解决问题的方法......