我没有加入JavaPairRDD
(String, Tuple2)
加入操作。
以下是数据详细信息 - [Userid, [(name, rating)]]
Output: [(user2,[(John,5)]), (user3,[(Mac,3), (Mac,2)]), (user1,[(Phil,3), (Phil,4)])]
我想计算每个用户的最小值,最大值和平均值。不确定哪种转换/行动可以帮助我。
答案 0 :(得分:0)
有多种方法可以实现这一目标,但最简单的方法是使用aggregateByKey
方法。
这是一个向您展示如何操作的示例
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]");
JavaSparkContext sc = new JavaSparkContext(conf);
String[] values = {"user2:John", "user1:Phil", "user3:Mac"};
String[] ratingValues = {"user2:5", "user3:3", "user3:2", "user1:3", "user1:4"};
JavaRDD<String> users = sc.parallelize(Arrays.asList(values));
JavaRDD<String> ratings = sc.parallelize(Arrays.asList(ratingValues));
JavaPairRDD<String, String> usersPair = users.mapToPair(
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String s) throws Exception {
String[] splits = s.split(":");
return new Tuple2<String, String>(splits[0], splits[1]);
}
}
);
JavaPairRDD<String, Integer> ratingsPair = ratings.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) throws Exception {
String[] splits = s.split(":");
return new Tuple2<String, Integer>(splits[0], Integer.parseInt(splits[1]));
}
}
);
JavaPairRDD<String, Tuple2<String, Integer>> joined = usersPair.join(ratingsPair);
JavaPairRDD<String, Tuple4<Integer, Integer, Integer, Integer>> aggregate = joined.aggregateByKey(new Tuple4<Integer, Integer, Integer, Integer>(Integer.MIN_VALUE, Integer.MAX_VALUE, 0, 0),
new Function2<Tuple4<Integer, Integer, Integer, Integer>, Tuple2<String, Integer>, Tuple4<Integer, Integer, Integer, Integer>>() {
public Tuple4<Integer, Integer, Integer, Integer> call(Tuple4<Integer, Integer, Integer, Integer> a, Tuple2<String, Integer> b) throws Exception {
return new Tuple4<Integer, Integer, Integer, Integer>(Math.max(a._1(), b._2()), Math.min(a._2(), b._2()), a._3() + b._2(), a._4() + 1);
}
}, new Function2<Tuple4<Integer, Integer, Integer, Integer>, Tuple4<Integer, Integer, Integer, Integer>, Tuple4<Integer, Integer, Integer, Integer>>() {
public Tuple4<Integer, Integer, Integer, Integer> call(Tuple4<Integer, Integer, Integer, Integer> a, Tuple4<Integer, Integer, Integer, Integer> b) throws Exception {
return new Tuple4<Integer, Integer, Integer, Integer>(Math.max(a._1(), b._1()), Math.min(a._2(), b._2()), a._3() + b._3(), a._4() + b._4());
}
});
JavaRDD<Tuple2<String, Tuple3<Integer, Integer, Double>>> aggregateWithMean = aggregate.map(
new Function<Tuple2<String,Tuple4<Integer,Integer,Integer,Integer>>, Tuple2<String, Tuple3<Integer, Integer, Double>>>() {
public Tuple2<String, Tuple3<Integer, Integer, Double>> call(Tuple2<String, Tuple4<Integer, Integer, Integer, Integer>> a) throws Exception {
Tuple3<Integer, Integer, Double> mean = new Tuple3<Integer, Integer, Double>(a._2()._1(), a._2()._2(), a._2()._3().doubleValue()/a._2()._4());
return new Tuple2<String, Tuple3<Integer, Integer, Double>>(a._1(), mean);
}
}
);
aggregateWithMean.foreach(new VoidFunction<Tuple2<String, Tuple3<Integer, Integer, Double>>>() {
public void call(Tuple2<String, Tuple3<Integer, Integer, Double>> stringTuple3Tuple2) throws Exception {
System.out.println(stringTuple3Tuple2);
}
});
}
使用Java API,由于泛型类型,这变得非常混乱。我建议您改用Scala API。看起来好多了: - )