使用Spark java查找最大的行号

时间:2016-09-07 10:26:08

标签: java apache-spark

我正面临一个问题,我必须找出最大的一行及其索引。这是我的方法

    SparkConf conf = new SparkConf().setMaster("local").setAppName("basicavg");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
    JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {

        @Override
        public Tuple2<Integer,String> call(String v1) throws Exception {
            // TODO Auto-generated method stub
            return new Tuple2<Integer, String>(v1.split(" ").length, v1);
        }
    });
    JavaPairRDD<Integer, String> linNoToWord = JavaPairRDD.fromJavaRDD(words).sortByKey(false);

    System.out.println(linNoToWord.first()._1+"  *********************  "+linNoToWord.first()._2);

2 个答案:

答案 0 :(得分:1)

通过这种方式,元组RDD将根据键进行排序,并且排序后的新rdd中的第一个元素具有最高长度:

JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {

    @Override
    public Tuple2<Integer,String> call(String v1) throws Exception {
        // TODO Auto-generated method stub
        return new Tuple2<Integer, String>(v1.split(" ").length, v1);
    }
});
JavaRDD<Tuple2<Integer,String>> tupleRDD1=  tupleRDD.sortBy(new Function<Tuple2<Integer,String>, Integer>() {

        @Override
        public Integer call(Tuple2<Integer, String> v1) throws Exception {
            // TODO Auto-generated method stub
            return v1._1;
        }
    }, false, 1);
    System.out.println(tupleRDD1.first());
}

答案 1 :(得分:0)

由于您关注行号和文本,请尝试此操作。

首先创建一个可序列化的类

public static class Line implements Serializable {
    public Line(Long lineNo, String text) {
        lineNo_ = lineNo;
        text_ = text;
    }
    public Long lineNo_;
    public String text_;
}

然后执行以下操作:

    SparkConf conf = new SparkConf().setMaster("local[1]").setAppName("basicavg");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> rdd = sc.textFile("/home/impadmin/words.txt");
    JavaPairRDD<Long, Line> linNoToWord2 = rdd.zipWithIndex().mapToPair(new PairFunction<Tuple2<String,Long>, Long, Line>() {
        public Tuple2<Long, Line> call(Tuple2<String, Long> t){

            return new Tuple2<Long, Line>(Long.valueOf(t._1.split(" ").length), new Line(t._2, t._1));
        }
    }).sortByKey(false);

    System.out.println(linNoToWord2.first()._1+"  *********************  "+linNoToWord2.first()._2.text_);