我是Spark的新手,并使用Java。在JavaRDD<Tuple2<String, String>
上,我想执行一个操作,其中Tuple2._1
作为键,Tuple2._2
作为值。对于所有匹配的键,如果我的Tuple2._2
字符串与任何其他Tuple2._2
字符串的匹配率超过50%,则仅返回1个Tuple2,否则返回所有它们。
tuple.reduceByKey(new PairFunction<Tuple2<String, String>, String, String>() {
public Tuple2<String, String> call(Tuple2<String, String> item1, Tuple2<String, String> item2) {
List<String> category1 = Arrays.asList(item1._2.split("\t")[0].split(","));
List<String> name1 = Arrays.asList(item1._2.split("\t")[1].split(","));
List<String> category2 = Arrays.asList(item2._2.split("\t")[0].split(","));
List<String> name2 = Arrays.asList(item1._2.split("\t")[1].split(","));
int counter1=0; int counter2=0;
for(String word: category1) {
if(category2.contains(word))
counter1++;
}
for(String word: name1) {
if(name2.contains(word))
counter2++;
}
if(counter1 >= 0.50*category1.size() && counter2 >= 0.50*name1.size()) {
}
else {
}
}
});
还是我可以在这里利用.filter()
来从所有匹配项中仅返回1个String / Tuple2吗?