从添加一列

时间:2019-05-13 12:37:02

标签: scala apache-spark

我有这3个数据框:

df1
+---+----+--------+
| id|file|  status|
+---+----+--------+
|  1| df2|employee|
|  2| df3|employee|
|  3| df2| trainee|
|  4| df3| trainee|
|  5| df3| trainee|
+---+----+--------+

df2
+---+------+----------+
| id|salary|entry_date|
+---+------+----------+
|  1|  4000|06-01-2017|
|  2|  7000|05-03-2015|
|  3|  1500|01-05-2019|
|  4|  1500|01-05-2019|
+---+------+----------+

df3
+---+------+----------+
| id|salary|entry_date|
+---+------+----------+
|  1|  4500|09-01-2016|
|  2|  7000|01-01-2016|
|  3|  1500|05-09-2019|
|  4|  1500|05-04-2019|
|  5|  1300|10-04-2019|
+---+------+----------+

我想加入这些数据框并仅保留正确的列,df1中的“文件”列告诉我们必须保留哪些数据框的列。

结果将是:

+---+----+--------+------+----------+
| id|file|  status|salary|entry_date|
+---+----+--------+------+----------+
|  1| df2|employee|  4000|06-01-2017|
|  2| df3|employee|  7000|01-01-2016|
|  3| df2| trainee|  1500|01-05-2019|
|  4| df3| trainee|  1500|05-04-2019|
|  5| df3| trainee|  1300|10-04-2019|
+---+----+--------+------+----------+

我曾考虑过使用withColumn("salary", when('file === "df2", df2("salary")).otherwise(df3("salary")))这样的示例,但是对于多个数据帧来说将是一团糟。

您知道一种更优雅的方式来获得相同的结果吗?

谢谢

0 个答案:

没有答案