如何从Spark Scala中的两个表中获取列的数据

时间:2019-07-04 10:32:57

标签: scala apache-spark

有两个表Customer1和Customer2

Customer1:列出客户的详细信息

https://docs.google.com/spreadsheets/d/1GuQaHhZ70D0NHGXuW51B5nNZXrSkthmEduHOhwoZmRg/edit#gid=722500260

Customer2:列出客户的更新详细信息

https://docs.google.com/spreadsheets/d/1GuQaHhZ70D0NHGXuW51B5nNZXrSkthmEduHOhwoZmRg/edit#gid=0

必须从两个表中获取CustomerName。如果更新了客户名,则必须从Customer2表中获取客户名,否则必须从Customer1表中获取。因此,所有客户名都应列出。

扩展结果集:

https://docs.google.com/spreadsheets/d/1GuQaHhZ70D0NHGXuW51B5nNZXrSkthmEduHOhwoZmRg/edit#gid=1227228207

如何在Spark Scala中实现这一目标?

1 个答案:

答案 0 :(得分:0)

您可以在 customer1 表上执行 Left Join ,然后在 customer2 上使用 coalesce >表以获取first non null value列的 customername

示例

scala> val customer1=Seq((1,"shiva","9994323565"),(2,"Mani","9994323567"),(3,"Sneha","9994323568")).toDF("customerid","customername","contact")
scala> val customer2=Seq((1,"shivamoorthy","9994323565"),(2,"Manikandan","9994323567")).toDF("customerid","customername","contact")
scala> customer1.as("c1")
       .join(customer2.as("c2"),$"c1.customerid" === $"c2.customerid","left")
       .selectExpr("c1.customerid",
            "coalesce(c2.customername,c1.customername) as customername")
       .show()

结果:

+----------+------------+
|customerid|customername|
+----------+------------+
|         1|shivamoorthy|
|         2|  Manikandan|
|         3|       Sneha|
+----------+------------+