从Spark RDD Tuple2返回条件结果

时间:2019-03-02 04:27:11

标签: java scala apache-spark

我是Spark的新手,并使用Java。在JavaRDD<Tuple2<String, String>上,我想执行一个操作,其中Tuple2._1作为键,Tuple2._2作为值。对于所有匹配的键,如果我的Tuple2._2字符串与任何其他Tuple2._2字符串的匹配率超过50%,则仅返回1个Tuple2,否则返回所有它们。

tuple.reduceByKey(new PairFunction<Tuple2<String, String>, String, String>() {
        public Tuple2<String, String> call(Tuple2<String, String> item1, Tuple2<String, String> item2) {
            List<String> category1 = Arrays.asList(item1._2.split("\t")[0].split(","));
            List<String> name1 = Arrays.asList(item1._2.split("\t")[1].split(","));
            List<String> category2 = Arrays.asList(item2._2.split("\t")[0].split(","));
            List<String> name2 = Arrays.asList(item1._2.split("\t")[1].split(","));

            int counter1=0; int counter2=0;
            for(String word: category1) {
                if(category2.contains(word))
                    counter1++;
            }
            for(String word: name1) {
                if(name2.contains(word))
                    counter2++;
            }
            if(counter1 >= 0.50*category1.size() && counter2 >= 0.50*name1.size()) {

            }
            else {

            }
        }
    });

还是我可以在这里利用.filter()来从所有匹配项中仅返回1个String / Tuple2吗?

0 个答案:

没有答案