在Apache Spark中,top()无法使用JavaPairRDD

时间:2015-06-23 14:34:40

标签: java apache-spark

我已经在Apache Spark的官方网站上解释了Java中现有的WordCount示例。扩展名为:

  1. 根据它们的存在来排序元组。 例如:现有的未分类订单:      (nec,8),(欧,11),(坐,7)      我想要的排序顺序:      (欧盟,11),(nec,8),(坐,7)
  2. 从排序列表中排名前3位。 排序很好,但top()不能在JavaPairRDD上运行。让我粘贴我的代码.. 其他方法是相同的,所以我在这里写我的主要方法:
  3.  public static void main(String[] args) {
          if (args.length < 1) {
              System.err
                      .println("Please provide the input file full path as argument");
              System.exit(0);
          }
    
            SparkConf conf = new        SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
            JavaSparkContext context = new JavaSparkContext(conf);
            JavaRDD<String> file = context.textFile(args[0]);
            JavaRDD<String> words = file.flatMap(WORDS_EXTRACTOR);
    
            /*Pairs with key= words and values=no.of occurances*/
            JavaPairRDD<String, Integer> pairs = words.mapToPair(WORDS_MAPPER);
            JavaPairRDD<String, Integer> counter = pairs.reduceByKey(WORDS_REDUCER);
    
            // First swapping (making key= no.of occurance and value=words... to allow sort based on no.of  occurances)
           JavaPairRDD<Integer, String> swappedPair = counter.mapToPair(new PairFunction<Tuple2<String,     Integer>, Integer, String>() {
                public Tuple2<Integer, String> call(Tuple2<String, Integer> item) throws Exception {
                    return item.swap();
                }
    
            });
    
            // after swapping tuples are sorted based on no.of occurances   
            JavaPairRDD<Integer, String> sortedCounter = swappedPair.sortByKey(false);
    
            // Reverse the swapping
            JavaPairRDD<String, Integer> reverseSwappedPair = sortedCounter.mapToPair(new PairFunction<Tuple2<Integer, String>, String, Integer>() {
                public Tuple2<String, Integer> call(Tuple2<Integer, String> item) throws Exception {
                    return item.swap();
                }
    
            });
           **reverseSwappedPair.top(3)**;
            reverseSwappedPair.saveAsTextFile(args[1]);
    
          }
    }
    

    没有粗体(**)行,剩下的代码运行正常并给出正确的结果意味着基于单词&#39; s no的元组的排序顺序。存在的我写了红线以获得前3个排序的元组,但它给出了如下所示的异常。我也尝试了其他JavaDD选项,

    JavaRDD co = JavaRDD.fromRDD(JavaPairRDD.toRDD(reverseSwappedPair),ReverseSwappedPair.classTag()); co.top(3);

    但它也给出了同样的例外情况。请帮我解决这个问题。我尝试了其他选项,但没有结果。

    Exception:
    15/06/23 07:21:28 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
    java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
        at com.google.common.collect.NaturalOrdering.compare(NaturalOrdering.java:26)
        at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
        at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
        at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
        at com.google.common.collect.Ordering.max(Ordering.java:572)
        at com.google.common.collect.Ordering.leastOf(Ordering.java:688)
        at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1334)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1331)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    15/06/23 07:21:28 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
        at com.google.common.collect.NaturalOrdering.compare(NaturalOrdering.java:26)
        at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
        at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
        at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
        at com.google.common.collect.Ordering.max(Ordering.java:572)
        at com.google.common.collect.Ordering.leastOf(Ordering.java:688)
        at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1334)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1331)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745
    

1 个答案:

答案 0 :(得分:1)

我认为你可以使用另一个API:java.util.List top(int num,                     java.util.Comparator comp)

它不能直接比较两个元组。编写自定义比较器。希望这会有所帮助