使用apache spark简单计算

时间:2015-07-22 17:14:16

标签: apache-spark

我没有加入JavaPairRDD (String, Tuple2)加入操作。 以下是数据详细信息 - [Userid, [(name, rating)]]

Output: [(user2,[(John,5)]), (user3,[(Mac,3), (Mac,2)]), (user1,[(Phil,3), (Phil,4)])]

我想计算每个用户的最小值,最大值和平均值。不确定哪种转换/行动可以帮助我。

1 个答案:

答案 0 :(得分:0)

有多种方法可以实现这一目标,但最简单的方法是使用aggregateByKey方法。

这是一个向您展示如何操作的示例

public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]");
    JavaSparkContext sc = new JavaSparkContext(conf);

    String[] values = {"user2:John", "user1:Phil", "user3:Mac"};
    String[] ratingValues = {"user2:5", "user3:3", "user3:2", "user1:3", "user1:4"};

    JavaRDD<String> users = sc.parallelize(Arrays.asList(values));
    JavaRDD<String> ratings = sc.parallelize(Arrays.asList(ratingValues));

    JavaPairRDD<String, String> usersPair = users.mapToPair(
            new PairFunction<String, String, String>() {
                public Tuple2<String, String> call(String s) throws Exception {
                    String[] splits = s.split(":");
                    return new Tuple2<String, String>(splits[0], splits[1]);
                }
            }
    );

    JavaPairRDD<String, Integer> ratingsPair = ratings.mapToPair(
            new PairFunction<String, String, Integer>() {
                public Tuple2<String, Integer> call(String s) throws Exception {
                    String[] splits = s.split(":");
                    return new Tuple2<String, Integer>(splits[0], Integer.parseInt(splits[1]));
                }
            }
    );

    JavaPairRDD<String, Tuple2<String, Integer>> joined = usersPair.join(ratingsPair);

    JavaPairRDD<String, Tuple4<Integer, Integer, Integer, Integer>> aggregate = joined.aggregateByKey(new Tuple4<Integer, Integer, Integer, Integer>(Integer.MIN_VALUE, Integer.MAX_VALUE, 0, 0),
            new Function2<Tuple4<Integer, Integer, Integer, Integer>, Tuple2<String, Integer>, Tuple4<Integer, Integer, Integer, Integer>>() {
        public Tuple4<Integer, Integer, Integer, Integer> call(Tuple4<Integer, Integer, Integer, Integer> a, Tuple2<String, Integer> b) throws Exception {
            return new Tuple4<Integer, Integer, Integer, Integer>(Math.max(a._1(), b._2()), Math.min(a._2(), b._2()), a._3() + b._2(), a._4() + 1);
        }
    }, new Function2<Tuple4<Integer, Integer, Integer, Integer>, Tuple4<Integer, Integer, Integer, Integer>, Tuple4<Integer, Integer, Integer, Integer>>() {
        public Tuple4<Integer, Integer, Integer, Integer> call(Tuple4<Integer, Integer, Integer, Integer> a, Tuple4<Integer, Integer, Integer, Integer> b) throws Exception {
            return new Tuple4<Integer, Integer, Integer, Integer>(Math.max(a._1(), b._1()), Math.min(a._2(), b._2()), a._3() + b._3(), a._4() + b._4());
        }
    });

    JavaRDD<Tuple2<String, Tuple3<Integer, Integer, Double>>> aggregateWithMean = aggregate.map(
            new Function<Tuple2<String,Tuple4<Integer,Integer,Integer,Integer>>, Tuple2<String, Tuple3<Integer, Integer, Double>>>() {
                public Tuple2<String, Tuple3<Integer, Integer, Double>> call(Tuple2<String, Tuple4<Integer, Integer, Integer, Integer>> a) throws Exception {
                    Tuple3<Integer, Integer, Double> mean = new Tuple3<Integer, Integer, Double>(a._2()._1(), a._2()._2(), a._2()._3().doubleValue()/a._2()._4());
                    return new Tuple2<String, Tuple3<Integer, Integer, Double>>(a._1(), mean);
                }
            }
    );

    aggregateWithMean.foreach(new VoidFunction<Tuple2<String, Tuple3<Integer, Integer, Double>>>() {
        public void call(Tuple2<String, Tuple3<Integer, Integer, Double>> stringTuple3Tuple2) throws Exception {
            System.out.println(stringTuple3Tuple2);
        }
    });
}

使用Java API,由于泛型类型,这变得非常混乱。我建议您改用Scala API。看起来好多了: - )