Spark加入2个数据框,列表

时间:2018-05-07 18:45:52

标签: scala apache-spark join

通过2个列表,有join两个Spark Dataframes具有不同列名的方法吗?

我知道如果列表中的名称相同,我可以执行以下操作:

val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")

或者如果我知道不同的列名,我可以这样做:

df1.join(
df2, 
df1("col_a") <=> df2("col_x")
    && df1("col_b") <=> df2("col_y"),
"left"
)

由于我的方法期望输入2个列表,这些列表指定每个DF的join使用哪些列,我想知道Scala Spark是否有办法做到这一点?

P.S 我正在寻找像python pandas merge

这样的东西
joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')

2 个答案:

答案 0 :(得分:1)

如果您需要两个字符串列表:

public class ProjectItem {
    private final Integer projectId;
    private final String projectName;
    private final String techstack;
    @JsonCreator
    public ProjectItem(
        @JsonProperty("projectId") Integer projectId, 
        @JsonProperty("projectName") String projectName, 
        @JsonProperty("techstack") String techstack
    ) {
        this.projectId = projectId;
        this.projectName = projectName;
        this.techstack = techstack;
    }
    public Integer getProjectId() {
        return projectId;
    }
    public String getProjectName() {
        return projectName;
    }
    public String getTechstack() {
        return techstack;
    }
}

只需拉链并减少:

val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")

答案 1 :(得分:0)

您可以自己轻松定义这样的方法:

 def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
      import org.apache.spark.sql.functions.lit
      val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
      left.join(right, joinExpr, how)
    }


val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")

val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")