在apache spark JavaPairRDD中按键排序

时间:2017-08-06 09:33:52

标签: java sorting apache-spark rdd

我的JavaPairRDD类型为Tuple2<Integer, Integer>

我想用我的密钥对JavaPairRDD进行排序,所以我写了一个像这样的Comparator:

JavaPairRDD<Tuple2<Integer, Integer>, Integer> Rresult=result.sortByKey(new Comparator<Tuple2<Integer, Integer>>() {
     @Override
     public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
         if(o1._1()==o2._1())
             return o1._2()-o2._2();
         return o1._1()-o2._1();
       }
},true);

如果它们按第二个条目排序,则按元组中的第一个条目对值进行排序。

但我收到以下错误堆栈:

java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

.. scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1083)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
    at 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at java.io.ObjectStrea

1 个答案:

答案 0 :(得分:2)

您是如何创建JavaPairRDD的?请在应用排序前检查。对于在sortByKey方法中直接使用新Comparator,Yow还将获得Task不可序列化的异常。您应该在单独的类中实现ComparatorSerializable,并将其传递给sortByKey方法。以下是供您参考的样本。

public class SparkSortSample {
public static void main(String[] args) {
    //SparkSession
    SparkSession spark = SparkSession
            .builder()
            .appName("SparkSortSample")
            .master("local[1]")
            .getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
    //Sample data
    List<Tuple2<Tuple2<Integer, Integer>, Integer>> inputList = new ArrayList<Tuple2<Tuple2<Integer, Integer>, Integer>>();
    inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(2, 444), 4444));
    inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(3, 333), 3333));
    inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(1, 111), 1111));
    inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(2, 222), 2222));
    //JavaPairRDD
    JavaPairRDD<Tuple2<Integer, Integer>, Integer> javaPairRdd = jsc.parallelizePairs(inputList);
    //Sorted RDD
    JavaPairRDD<Tuple2<Integer, Integer>, Integer> sortedPairRDD = javaPairRdd.sortByKey(new TupleComparator(), true);
    sortedPairRDD.foreach(rdd -> {
        System.out.println("sort = " + rdd);
    });
    // stop
    jsc.stop();
    jsc.close();
   }
}

这是TupleComparator类,它实现了Comparator和Serializable接口。

class TupleComparator implements Comparator<Tuple2<Integer, Integer>>, Serializable {
@Override
public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
    if (o1._1() == o2._1())
        return o1._2() - o2._2();
    return o1._1() - o2._1();
  }
}