在三个RDD中过滤JavaRDD?

时间:2016-10-08 12:20:42

标签: apache-spark rdd

我想根据特定条件将JavaRdd过滤为三个不同的RDD。现在我正在阅读相同的rdd三次并过滤它。有没有其他有效的方法在单次扫描中实现这一点?

Example:

Like I have an rdd of type string and I want to filter it based on name 'anshu','suman' and 'neeraj'

rdd1=rdd.filter(s->{s.contains("anshu")?return true; else return false;})
rdd2=rdd.filter(s->{s.contains("suman")?return true; else return false;})
rdd3=rdd.filter(s->{s.contains("neeraj")?return true; else return false;})

Instead of filtering same rdd thrice,can I do it in single filter?

1 个答案:

答案 0 :(得分:0)

您可以查看以下示例。在这里,我使用map,其中所有三个条件都将作为键,我们可以使用reduce来组合与这些键关联的值。

JavaRDD<List<String>> rdd = javaSparkContext.textFile("/tmp/mathsetdata.dat").filter(new Function<String, Boolean>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Boolean call(String v1) throws Exception {
                String split[] = v1.split(" ");
                return split[0].equals("suman") || split[0].equals("anshu") || split[0].equals("neeraj");
            }
        }).mapToPair(new PairFunction<String, String, List<String>>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<String, List<String>> call(String t) throws Exception {
                String split[] = t.split(" ");
                List<String> list = new ArrayList<String>();
                list.add(split[1].trim());
                return new Tuple2<String, List<String>>(split[0].trim(), list);
            }
        }).reduceByKey(new Function2<List<String>, List<String>, List<String>>() {
            private static final long serialVersionUID = 1L;
            @Override
            public List<String> call(List<String> v1, List<String> v2) throws Exception {
                List<String> list = new ArrayList<String>();
                list.addAll(v1);
                list.addAll(v2);
                return list;
            }
        }).values();

示例文件:

suman 1001
anshu 1002
neeraj 1003
suman 1006
anshu 1007
neeraj 1008
suman 1016
anshu 1027
neeraj 1018

还可以执行进一步的操作。例如。

Tuple2<String, Integer> rdds = rdd.filter(new Function<Tuple2<String, List<String>>, Boolean>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Boolean call(Tuple2<String, List<String>> v1) throws Exception {
                return v1._1.equals("suman");
            }
        }).map(new Function<Tuple2<String, List<String>>, Tuple2<String, Integer>>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<String, Integer> call(Tuple2<String, List<String>> v1) throws Exception {
                Integer sum = 0;
                for (String str : v1._2) {
                    sum += Integer.parseInt(str);
                }
                return new Tuple2<String, Integer>(v1._1, sum);
            }
        }).collect().get(0);