我正在尝试按值对RDD进行排序,如果多个值相等,那么我需要按字典顺序按键这些值。
代码:
JavaPairRDD <String,Long> rddToSort = rddMovieReviewReducedByKey.mapToPair(new PairFunction < Tuple2 < String, MovieReview > , String, Long > () {
@Override
public Tuple2 < String, Long > call(Tuple2 < String, MovieReview > t) throws Exception {
return new Tuple2 < String, Long > (t._1, t._2.count);
}
});
我到目前为止所做的是使用takeOrdered
并提供CustomComperator
,但由于takeOrdered
无法处理大量数据,因此在运行代码时退出(它占用了操作系统无法处理的大量内存):
List < Tuple2 < String, Long >> rddSorted = rddMovieReviewReducedByKey.mapToPair(new PairFunction < Tuple2 < String, MovieReview > , String, Long > () {
@Override
public Tuple2 < String, Long > call(Tuple2 < String, MovieReview > t) throws Exception {
return new Tuple2 < String, Long > (t._1, t._2.count);
}
}).takeOrdered(newTopMovies, MapLongValueComparator.VALUE_COMP);
Comperator:
static class MapLongValueComparator implements Comparator < Tuple2 < String, Long >> , Serializable {
private static final long serialVersionUID = 1L;
private static final MapLongValueComparator VALUE_COMP = new MapLongValueComparator();
@Override
public int compare(Tuple2 < String, Long > o1, Tuple2 < String, Long > o2) {
if (o1._2.compareTo(o2._2) == 0) {
return o1._1.compareTo(o2._1);
}
return -o1._2.compareTo(o2._2);
}
}
ERROR:
16/06/30 21:09:23 INFO scheduler.DAGScheduler: Job 18 failed: takeOrdered at MovieAnalyzer.java:708, took 418.149182 s
你会如何排序这个RDD?您如何考虑TopKMovies
考虑值,并且在按字典顺序排列密钥的情况下。
感谢。
答案 0 :(得分:3)
使用sortByKey和比较器&amp;解决了这个问题。将<String, Long>
PairRDD与< Tuple2<String,Long> , Long>
PairRDD
JavaPairRDD <Tuple2<String,Long>, Long> sortedRdd = rddMovieReviewReducedByKey.mapToPair(new PairFunction < Tuple2 < String, MovieReview > , Tuple2<String,Long>, Long > () {
@Override
public Tuple2 < Tuple2<String,Long>, Long > call(Tuple2 < String, MovieReview > t) throws Exception {
return new Tuple2 < Tuple2<String,Long>, Long > (new Tuple2<String,Long>(t._1,t._2.count), t._2.count);
}
}).sortByKey(new TupleMapLongComparator(), true, 100);
JavaPairRDD <String,Long> sortedRddToPairs = sortedRdd.mapToPair(new PairFunction<Tuple2<Tuple2<String,Long>,Long>, String, Long>() {
@Override
public Tuple2<String, Long> call(
Tuple2<Tuple2<String, Long>, Long> t) throws Exception {
return new Tuple2 < String, Long > (t._1._1, t._1._2);
}
});
比较
private class TupleMapLongComparator implements Comparator<Tuple2<String,Long>>, Serializable {
@Override
public int compare(Tuple2<String,Long> tuple1, Tuple2<String,Long> tuple2) {
if (tuple1._2.compareTo(tuple2._2) == 0) {
return tuple1._1.compareTo(tuple2._1);
}
return -tuple1._2.compareTo(tuple2._2);
}
}
答案 1 :(得分:1)
你在Spark中尝试过二次排序吗?