我想比较具有相同架构的两个数据帧,并且具有主键列。
对于每个主键,如果其他列有任何差异(可能是多列,因此需要使用一些动态方式来扫描所有其他列),我想输出两个数据帧的列名和值。
另外,如果另一个数据帧中不存在一个主键,我想输出结果(因此将使用“全外连接”)。这是一些例子:
dataframe1:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book2 | 2 |
|3 |book3 | 3 |
|4 |book4 | 4 |
+-----------+------+------+
dataframe2:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book8 | 8 |
|3 |book3 | 7 |
|5 |book5 | 5 |
+-----------+------+------+
结果将是:
+-----------+------+----------+------------+------------*
|primary_key|diff_column_name | dataframe1 | dataframe2 |
+-----------+------+----------+------------+------------*
|2 |book | book2 | book8 |
|2 |number | 2 | 8 |
|3 |number | 3 | 7 |
|4 |book | book4 | null |
|4 |number | 4 | null |
|5 |book | null | book5 |
|5 |number | null | 5 |
+-----------+------+----------+------------+------------*
我知道第一步是在主键上加入两个数据帧:
// joining the two DFs on primary_key
val result = df1.as("l")
.join(df2.as("r"), "primary_key", "fullouter")
但我不知道该怎么办。有人可以给我一些建议吗?感谢
答案 0 :(得分:2)
数据:
import org.apache.spark.sql.functions._
进口
val cols = Seq("book", "number")
定义列列表:
val joined = df1.as("l").join(df2.as("r"), Seq("primary_key"), "fullouter")
立即加入:
val comp = explode(array(cols.map(c => struct(
lit(c).alias("diff_column_name"),
// Value left
col(s"l.${c}").cast("string").alias("dataframe1"),
// Value right
col(s"r.${c}").cast("string").alias("dataframe2"),
// Differs
not(col(s"l.${c}") <=> col(s"r.${c}")).alias("diff")
)): _*))
定义:
joined
.withColumn("comp", comp)
.select($"primary_key", $"comp.*")
// Filter out mismatches and get rid of obsolete diff
.where($"diff").drop("diff")
.orderBy("primary_key").show
// +-----------+----------------+----------+----------+
// | 2| book| book2| book8|
// | 2| number| 2| 8|
// | 3| number| 3| 7|
// | 4| book| book4| null|
// | 4| number| 4| null|
// | 5| book| null| book5|
// | 5| number| null| 5|
// +-----------+----------------+----------+----------+
选择并过滤:
.fsBody .fsForm .fsFieldCell {
column-count:3;
}