我已经在Apache Spark的官方网站上解释了Java中现有的WordCount示例。扩展名为:
public static void main(String[] args) { if (args.length < 1) { System.err .println("Please provide the input file full path as argument"); System.exit(0); }
SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> file = context.textFile(args[0]);
JavaRDD<String> words = file.flatMap(WORDS_EXTRACTOR);
/*Pairs with key= words and values=no.of occurances*/
JavaPairRDD<String, Integer> pairs = words.mapToPair(WORDS_MAPPER);
JavaPairRDD<String, Integer> counter = pairs.reduceByKey(WORDS_REDUCER);
// First swapping (making key= no.of occurance and value=words... to allow sort based on no.of occurances)
JavaPairRDD<Integer, String> swappedPair = counter.mapToPair(new PairFunction<Tuple2<String, Integer>, Integer, String>() {
public Tuple2<Integer, String> call(Tuple2<String, Integer> item) throws Exception {
return item.swap();
}
});
// after swapping tuples are sorted based on no.of occurances
JavaPairRDD<Integer, String> sortedCounter = swappedPair.sortByKey(false);
// Reverse the swapping
JavaPairRDD<String, Integer> reverseSwappedPair = sortedCounter.mapToPair(new PairFunction<Tuple2<Integer, String>, String, Integer>() {
public Tuple2<String, Integer> call(Tuple2<Integer, String> item) throws Exception {
return item.swap();
}
});
**reverseSwappedPair.top(3)**;
reverseSwappedPair.saveAsTextFile(args[1]);
}
}
没有粗体(**)行,剩下的代码运行正常并给出正确的结果意味着基于单词&#39; s no的元组的排序顺序。存在的我写了红线以获得前3个排序的元组,但它给出了如下所示的异常。我也尝试了其他JavaDD选项,
JavaRDD co = JavaRDD.fromRDD(JavaPairRDD.toRDD(reverseSwappedPair),ReverseSwappedPair.classTag()); co.top(3);
但它也给出了同样的例外情况。请帮我解决这个问题。我尝试了其他选项,但没有结果。
Exception:
15/06/23 07:21:28 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
at com.google.common.collect.NaturalOrdering.compare(NaturalOrdering.java:26)
at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at com.google.common.collect.Ordering.max(Ordering.java:572)
at com.google.common.collect.Ordering.leastOf(Ordering.java:688)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1331)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/23 07:21:28 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
at com.google.common.collect.NaturalOrdering.compare(NaturalOrdering.java:26)
at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at com.google.common.collect.Ordering.max(Ordering.java:572)
at com.google.common.collect.Ordering.leastOf(Ordering.java:688)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1331)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745
答案 0 :(得分:1)
我认为你可以使用另一个API:java.util.List top(int num, java.util.Comparator comp)
它不能直接比较两个元组。编写自定义比较器。希望这会有所帮助