我的JavaPairRDD类型为Tuple2<Integer, Integer>
我想用我的密钥对JavaPairRDD进行排序,所以我写了一个像这样的Comparator:
JavaPairRDD<Tuple2<Integer, Integer>, Integer> Rresult=result.sortByKey(new Comparator<Tuple2<Integer, Integer>>() {
@Override
public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
if(o1._1()==o2._1())
return o1._2()-o2._2();
return o1._1()-o2._1();
}
},true);
如果它们按第二个条目排序,则按元组中的第一个条目对值进行排序。
但我收到以下错误堆栈:
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
.. scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1083)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at java.io.ObjectStrea
答案 0 :(得分:2)
您是如何创建JavaPairRDD
的?请在应用排序前检查。对于在sortByKey
方法中直接使用新Comparator,Yow还将获得Task不可序列化的异常。您应该在单独的类中实现Comparator
和Serializable
,并将其传递给sortByKey
方法。以下是供您参考的样本。
public class SparkSortSample {
public static void main(String[] args) {
//SparkSession
SparkSession spark = SparkSession
.builder()
.appName("SparkSortSample")
.master("local[1]")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
//Sample data
List<Tuple2<Tuple2<Integer, Integer>, Integer>> inputList = new ArrayList<Tuple2<Tuple2<Integer, Integer>, Integer>>();
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(2, 444), 4444));
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(3, 333), 3333));
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(1, 111), 1111));
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(2, 222), 2222));
//JavaPairRDD
JavaPairRDD<Tuple2<Integer, Integer>, Integer> javaPairRdd = jsc.parallelizePairs(inputList);
//Sorted RDD
JavaPairRDD<Tuple2<Integer, Integer>, Integer> sortedPairRDD = javaPairRdd.sortByKey(new TupleComparator(), true);
sortedPairRDD.foreach(rdd -> {
System.out.println("sort = " + rdd);
});
// stop
jsc.stop();
jsc.close();
}
}
这是TupleComparator类,它实现了Comparator和Serializable接口。
class TupleComparator implements Comparator<Tuple2<Integer, Integer>>, Serializable {
@Override
public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
if (o1._1() == o2._1())
return o1._2() - o2._2();
return o1._1() - o2._1();
}
}