如何从JavaPairRDD中过滤记录

时间:2016-02-09 16:14:11

标签: java apache-spark

我正在Apache火花中做一个简单的WordCount示例现在我终于得到了数字我想只过滤掉它中的唯一单词。

public class SparkClass {
    public static void main(String[] args) {

    String file = "/home/bhaumik/Documents/my";
    JavaSparkContext sc = new JavaSparkContext("local", "SimpleApp");
    JavaRDD<String> lines = sc.textFile("/home/bhaumik/Documents/myText", 5)
            .flatMap(new FlatMapFunction<String, String>() {

                @Override
                public Iterable<String> call(String t) throws Exception {
                    // TODO Auto-generated method stub
                    return Arrays.asList(t.split(" "));
                }
            });

    JavaPairRDD<String, Integer> pairs = lines.mapToPair(new PairFunction<String, String, Integer>() {

        @Override
        public Tuple2<String, Integer> call(String t) throws Exception {
            // TODO Auto-generated method stub
            return new Tuple2<String, Integer>(t, 1);
        }
    });

    JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {

        @Override
        public Integer call(Integer v1, Integer v2) throws Exception {
            // TODO Auto-generated method stub
            return v1 + v2;
        }
    });
}

}

2 个答案:

答案 0 :(得分:2)

计数中,您有一个带有密钥及其出现次数的RDD。您现在无法获得最小值,因此您应该减少

Tuple2<String, Integer> minApp = counts.reduce((a, b) -> (a._2 > b._2)? b : a);

答案 1 :(得分:1)

JavaPairRDD<String,Integer> uniqueIP = counts.filter(newFunction<Tuple2<String,Integer>,Boolean()>{
   @Override
   public Boolean call(Tuple<String, Integer> v1) throws Exception {
   return v1._2.equals(1);
   }
});

这就是我解决问题的方法......