Question

我有两个配置单A和B及其各自的数据框df_a和df_b

A
+----+----- +-----------+
| id | name | mobile1   |
+----+----- +-----------+
| 1  | Matt | 123456798 |
+----+----- +-----------+
| 2  | John | 123456798 |
+----+----- +-----------+
| 3  | Lena |           |
+----+----- +-----------+

B
+----+----- +-----------+
| id | name | mobile2   |
+----+----- +-----------+
| 3  | Lena | 123456798 |
+----+----- +-----------+

想要执行类似于

的操作

select A.name, nvl(nvl(A.mobile1, B.mobile2), 0) from A left outer join B on A.id = B.id

到目前为止，我已经提出了

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(?)

我无法弄清楚如何像在Hive查询中那样有条件地选择mobile1或mobile2或0。

有人可以帮帮我吗？我正在使用Spark 1.5。

Answer 1

使用coalesce：

import org.apache.spark.sql.functions._
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
     coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

如果它存在，将使用mobile1，如果不存在，则使用mobile2，如果mobile2不存在则为0

Answer 2

您可以使用spark sql的nanvl功能。应用后应该类似于：

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(df_a("name"), nanvl(nanvl(df_a("mobile1"), df_b("mobile2")), 0))

Spark / Scala-从Dataframe中有条理地选择列

2 个答案: