如何连接两个DataFrame并更新涉及多个主键的缺失值?

时间:2017-06-21 03:36:39

标签: scala apache-spark dataframe

案例1合并

旧数据框:

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1|  aa|  ab| ac|
## |  2|  bb|  bc| bd|
## +---+----+----+---+

新数据框:

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1|  aa|  ab| ad|
## |  2|  bb|  bb| bd|
## |  3|  cc|  cc| cc|
## +---+----+----+---+

结果:

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1|  aa|  ab| ad|
## |  2|  bb|  bb| bd|
## |  3|  cc|  cc| cc|
## +---+----+----+---+

具有多个键的外连接是否有效?

1 个答案:

答案 0 :(得分:1)

根据您的示例数据,我认为新数据框中的元素将在旧数据框中被选取,如果它们不同的话。

[更新]如果val列是动态的,您可以将foldLeft应用于列列表,如下所示:

val dfOld = Seq(
  (1, "aa", "ab", "ac"),
  (2, "bb", "bc", "bd")
).toDF("pk1", "pk2", "val1", "val2")

val dfNew = Seq(
  (1, "aa", "ab", "ad"),
  (2, "bb", "bb", "bd"),
  (3, "cc", "cc", "cc")
).toDF("pk1", "pk2", "val1", "val2")

// Assemble the list of selected val-columns
val valColumns = dfNew.columns.filter(x => x != "pk1" && x != "pk2")

val dfJoined = dfNew.join(dfOld, Seq("pk1", "pk2"), "left_outer")

// Generate diff-columns from the val-column list
val dfDiff = valColumns.foldLeft(dfJoined)( (acc, x ) =>
  acc.withColumn(
    x + "diff",
    when( !(dfNew(x) === dfOld(x)) || (dfOld(x).isNull), dfNew(x) ).otherwise( null )
  ).
  drop(x)
)

dfDiff.show
+---+---+--------+--------+
|pk1|pk2|val1diff|val2diff|
+---+---+--------+--------+
|  1| aa|    null|      ad|
|  2| bb|      bb|    null|
|  3| cc|      cc|      cc|
+---+---+--------+--------+