我有两个相同对象的JavaRDD,我想将数据合并为一个。 这些是:
域
public class User {
String name;
String email;
String profession;
Integer age;
// constructor
// setters and getters
}
RDD 1
User user1 = new User ("Name", "email@email.com");
User user2 = new User ("Name2", "email2@email.com");
List<User> userList = new ArrayList<>();
userList.add(user1);
userList.add(user2);
JavaRDD<User> leftUserJavaRDD = sc.parallelize(userList);
RDD 2
User user3 = new User ("email@email.com", "Software Engineer", 26);
User user4 = new User ("email2@email.com", "Lawyer", 35);
List<User> userList2 = new ArrayList<>();
userList.add(user3);
userList.add(user4);
JavaRDD<User> rightUserJavaRDD = sc.parallelize(userList2);
我想将两个RDD与通用电子邮件地址结合起来。 我想要的组合RDD是:
User user1and3 = new User (
"Name",
"email@email.com",
"Software Engineer",
26);
User user2and4 = new User (
"Name2",
"email2@email.com",
"Lawyer",
35);
如何使用Java在Spark中执行此操作?
我尝试了union
和cartesian
,但没有效果。
答案 0 :(得分:1)
我得到了同事的帮助,这是我们得到的解决方案。
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function2;
import scala.Tuple2;
import java.util.List;
public JavaRDD<User> getCombinedUsers(JavaRDD<User> leftUserJavaRDD, JavaRDD<User> rightUserJavaRDD) {
JavaPairRDD<String, User> leftUserJavaPairRDD =
leftUserJavaRDD.mapToPair(user -> new Tuple2<>(user.getEmail(), user));
JavaPairRDD<String, User> rightUserJavaPairRDD =
rightUserJavaRDD.mapToPair(user -> new Tuple2<>(user.getEmail(), user));
return leftUserJavaPairRDD
.union(rightUserJavaPairRDD)
.reduceByKey(merge).values();
}
/**
* Reduce Function for merging User with no profession and age information with the one that has profession and age information.
*/
private static Function2<User, User, User> merge =
(User left, User right) ->
new User(left.getName(), left.getEmail(), right.getProfession(), right.getAge());