Spark:将两个Java对象RDD合并为一个

时间:2017-11-09 15:24:47

标签: java apache-spark rdd

我有两个相同对象的JavaRDD,我想将数据合并为一个。 这些是:

public class User {
    String name;
    String email;
    String profession;
    Integer age;

    // constructor

    // setters and getters
}

RDD 1

User user1 = new User ("Name", "email@email.com");
User user2 = new User ("Name2", "email2@email.com");

List<User> userList = new ArrayList<>();
userList.add(user1);
userList.add(user2);

JavaRDD<User> leftUserJavaRDD = sc.parallelize(userList);

RDD 2

User user3 = new User ("email@email.com", "Software Engineer", 26);
User user4 = new User ("email2@email.com", "Lawyer", 35);

List<User> userList2 = new ArrayList<>();
userList.add(user3);
userList.add(user4);

JavaRDD<User> rightUserJavaRDD = sc.parallelize(userList2);

我想将两个RDD与通用电子邮件地址结合起来。 我想要的组合RDD是:

User user1and3 = new User (
        "Name",
        "email@email.com",
        "Software Engineer",
        26);

User user2and4 = new User (
        "Name2",
        "email2@email.com",
        "Lawyer",
        35);

如何使用Java在Spark中执行此操作? 我尝试了unioncartesian,但没有效果。

1 个答案:

答案 0 :(得分:1)

我得到了同事的帮助,这是我们得到的解决方案。

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function2;
import scala.Tuple2;

import java.util.List;

public JavaRDD<User> getCombinedUsers(JavaRDD<User> leftUserJavaRDD, JavaRDD<User> rightUserJavaRDD) {

     JavaPairRDD<String, User> leftUserJavaPairRDD =
                leftUserJavaRDD.mapToPair(user -> new Tuple2<>(user.getEmail(), user));

     JavaPairRDD<String, User> rightUserJavaPairRDD =
                rightUserJavaRDD.mapToPair(user -> new Tuple2<>(user.getEmail(), user));

     return leftUserJavaPairRDD
                .union(rightUserJavaPairRDD)
                .reduceByKey(merge).values();
}

/**
 * Reduce Function for merging User with no profession and age information with the one that has profession and age information.
 */
private static Function2<User, User, User> merge =
            (User left, User right) ->
                    new User(left.getName(), left.getEmail(), right.getProfession(), right.getAge());