通过2个列表,有join
两个Spark Dataframes
具有不同列名的方法吗?
我知道如果列表中的名称相同,我可以执行以下操作:
val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")
或者如果我知道不同的列名,我可以这样做:
df1.join(
df2,
df1("col_a") <=> df2("col_x")
&& df1("col_b") <=> df2("col_y"),
"left"
)
由于我的方法期望输入2个列表,这些列表指定每个DF的join
使用哪些列,我想知道Scala Spark是否有办法做到这一点?
P.S
我正在寻找像python pandas merge
:
joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')
答案 0 :(得分:1)
如果您需要两个字符串列表:
public class ProjectItem {
private final Integer projectId;
private final String projectName;
private final String techstack;
@JsonCreator
public ProjectItem(
@JsonProperty("projectId") Integer projectId,
@JsonProperty("projectName") String projectName,
@JsonProperty("techstack") String techstack
) {
this.projectId = projectId;
this.projectName = projectName;
this.techstack = techstack;
}
public Integer getProjectId() {
return projectId;
}
public String getProjectName() {
return projectName;
}
public String getTechstack() {
return techstack;
}
}
只需拉链并减少:
val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")
答案 1 :(得分:0)
您可以自己轻松定义这样的方法:
def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
import org.apache.spark.sql.functions.lit
val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
left.join(right, joinExpr, how)
}
val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")
val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")