Spark - 带排序的字数(不排序)

时间:2015-04-19 08:27:56

标签: java apache-spark

我正在学习Spark并尝试扩展WordCount示例,并根据其出现次数对单词进行排序。问题出在哪里,运行代码后我得到的结果没有排序:

(708,word1)
(46,word2)
(65,word3)

因此,出于某种原因,排序似乎失败了。类似的效果是使用wordSortedByCount.first()命令并限制只执行一个线程。

import java.io.Serializable;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import scala.Tuple2;

public class JavaWordCount2 {
    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf().setAppName("JavaWordCountAndSort");
        int numOfKernels = 8;
        sparkConf.setMaster("local[" + numOfKernels + "]");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);

        JavaRDD<String> lines = ctx.textFile("data.csv", 1);
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line
                .split("[,; :\\.]")));
        words = words.flatMap(line -> Arrays.asList(line.replaceAll("[\"\\(\\)]", "").toLowerCase()));

        // sum words
        JavaPairRDD<String, Integer> counts = words.mapToPair(
                w -> new Tuple2<String, Integer>(w, 1)).reduceByKey(
                (x, y) -> x + y);

        // minimum 5 occurences
        // counts = counts.filer(s -> s._2 > 5);
        counts = counts.filter(new Function<Tuple2<String,Integer>, Boolean>() {
            @Override
            public Boolean call(Tuple2<String, Integer> v1) throws Exception {
                return v1._2 > 5;
            }
        });

        // to enable sorting by value (count) and not key -> value-to-key conversion pattern
        // setting value to null, since it won't be used anymore
        JavaPairRDD<Tuple2<Integer, String>, Integer> countInKey = counts.mapToPair(a -> new Tuple2(new Tuple2<Integer, String>(a._2, a._1), null));

        // sort by num of occurences
        JavaPairRDD<Tuple2<Integer, String>, Integer> wordSortedByCount = countInKey.sortByKey(new TupleComparator(), true);

        // print result    
        List<Tuple2<Tuple2<Integer, String>, Integer>> output = wordSortedByCount.take(10);
        for (Tuple2<?, ?> tuple : output) {
            System.out.println(tuple._1());
        }
        ctx.stop();
    }
}

比较类:

import java.io.Serializable;
import java.util.Comparator;
import scala.Tuple2;
public class TupleComparator implements Comparator<Tuple2<Integer, String>>,
        Serializable {
    @Override
    public int compare(Tuple2<Integer, String> tuple1,
            Tuple2<Integer, String> tuple2) {
        return tuple1._1 < tuple2._1 ? 0 : 1;
    }
}

有人能指出我的代码有什么问题吗?

1 个答案:

答案 0 :(得分:3)

您的代码的第一个问题是在比较器中。实际上,您返回0或1,而compare方法应返回一些负值,无论第一个元素是否在第二个元素之前。所以改成它:

@Override
public int compare(Tuple2<Integer, String> tuple1,
        Tuple2<Integer, String> tuple2) {
    return tuple1._1 - tuple2._1;
}

此外,您应该将sortByKey的第二个参数设置为false,否则您将获得升序,即从最低到最高,这与您想要的完全相反