我有2个数据集:
数据集1:
CustID CustName CustRegion
1 Joe Canada
2 Jane UK
数据集2:
CustID CustName CustCity CustPrice
1 Joe Berlin 20
2 Jane UK 11
3 Bill France 30
在合并两个数据框后希望获得如下输出:
CustID CustName CustRegion CustCity CustPrice
1 Joe Canada (null, Berlin) (null, 20)
2 Jane UK (null, UK) (null, 11)
3 (null, Bill) (null, null) (null, France) (null, 30)
我尝试使用以下代码,但在以下情况下使用...时出错:
val cols = df1.columns.filter(_ != "CustID").toList
// function to create an expression that results in null for similar values,
// and with a two-item array with the differing values otherwise
def mapDiffs(name: String) = {
when($"df1.$name" === $"df2.$name", df2.$name)
.otherwise(array($"l.$name", $"r.$name"))
.as(name)
}
// joining the two DFs on OrgId
val result = df1.as("l")
.join(targetDF.as("r"), "CustID")
.select($"CustID" :: cols.map(mapDiffs): _*)
result.show()
答案 0 :(得分:0)
我发现您的尝试有两个错误
l
和r
,但在您使用的函数中
df1
和df2
这是一个可能的解决方案。我已经创建了问题中提到的临时数据框,并在每个部分都添加了注释,以进行解释和说明
//creating temporary dataframes for testing
val df1 = Seq(
("1", "Joe", "Canada"),
("2", "Jane", "UK")
).toDF("CustID", "CustName", "CustRegion")
val targetDF = Seq(
("1", "Joe", "Berlin", "20"),
("2", "Jane", "UK", "11"),
("3", "Bill", "France", "30")
).toDF("CustID", "CustName", "CustCity", "CustPrice")
//selecting column names of both dataframe without CustID
val cols1 = df1.columns.filter(_ != "CustID").toList
val cols2 = targetDF.columns.filter(_ != "CustID").toList
//all column names without duplicates expect CustID of both dataframes
val allCols = cols1 ++ cols2 toSet
//extra columns needed for both dataframes columns to be same
val extraColsDf1 = allCols -- cols1
val extraColsTargetDf = allCols -- cols2
//adding the extra columns and populating with null
val tempDf1 = extraColsDf1.foldLeft(df1){(temp, c) => temp.withColumn(c, lit(null))}
val tempTargetDf = extraColsTargetDf.foldLeft(targetDF){(temp, c) => temp.withColumn(c, lit(null))}
//function for getting your desired output
def mapDiffs(name: String) = {
when($"l.$name".isNull && $"r.$name".isNull, concat_ws(",", lit("null"), lit("null"))) // for outputs like null,null
.otherwise(when($"r.$name".isNull, $"l.$name") //for outputs like Canada and UK
.otherwise(when($"l.$name".isNull, concat_ws(",", lit("null"), $"r.$name")) //for outputs like null,Bill
.otherwise(when($"l.$name" === $"r.$name" && $"l.$name".isNotNull && $"r.$name".isNotNull, $"l.$name") // for outputs like Joe
.otherwise(concat_ws(",", $"l.$name", $"r.$name")) //for rest of the outputs
)
)
)
.as(name)
}
// outer joining the two DFs on CustID and calling the above function
val result = tempDf1.as("l")
.join(tempTargetDf.as("r"), Seq("CustID"), "outer")
.select(Seq($"CustID") ++ allCols.map(mapDiffs): _*)
//showing the output
result.show(false)
应该给您
+------+---------+----------+-----------+---------+
|CustID|CustName |CustRegion|CustCity |CustPrice|
+------+---------+----------+-----------+---------+
|3 |null,Bill|null,null |null,France|null,30 |
|1 |Joe |Canada |null,Berlin|null,20 |
|2 |Jane |UK |null,UK |null,11 |
+------+---------+----------+-----------+---------+
您可以根据需要进行修改
我希望答案会有所帮助