如何在scala中添加数据框内容忽略空值

时间:2018-04-10 15:03:45

标签: scala apache-spark spark-dataframe

我在scala中有如下所示的数据框。当我在两个不同大小的数据帧上进行完全外连接时,我得到了这个结果。

这些是执行以下查询后得到的键值对

select * from TEMP1 a FULL OUTER JOIN TEMP2 b ON a.T_ROWKEY = b.N_ROWKEY

这下面的df描述了我们需要添加相似键值并创建新数据帧的键值对,如果没有类似值,只需保持值不变。

[2552195C312,100,2552195C312,5]
[null,null,175831A638,1]
[48061B887,1,null,null]
[null,null,171539C177,1]
[null,null,5584D2379,4]
[118732EE7792,3,null,null]
[null,null,8157FF1915,1]
[14310AA872,1000,14310AA872,7]
[148BB41539,5,148BB41539,1]
[40513SS68,1,null,null]
[null,null,199915UY72,11]
[11429401AW5,3,null,null]
[187755CD00,4,null,null]
[834413CV18,1,null,null]
[185475XS2,14,null,null]
[11716817SD8,2,null,null]
[2552998AS99,12,null,null]
[null,null,19792WS37,2]
[153054WE02,1,null,null]
[null,null,8131128ER1,7]

我期待像

这样的结果
[2552195C312,105]
[175831A638,1]
[48061B887,1]
[171539C177,1]
[5584D2379,4]
[118732EE7792,3]
[8157FF1915,1]
[14310AA872,1007]
[148BB41539,6]
[40513SS68,1]
[199915UY72,11]
[11429401AW5,3]
[187755CD00,4]
[834413CV18,1]
[185475XS2,14]
[11716817SD8,2]
[2552998AS99,12]
[19792WS37,2]
[153054WE02,1]
[8131128ER1,7]

请一些人帮忙。感谢你的帮助。

1 个答案:

答案 0 :(得分:1)

由于您尚未说明值列名,我假设schema之后的dataframe outer join >是

root
 |-- T_ROWKEY: string (nullable = true)
 |-- T_ROWVALUE: integer (nullable = true)
 |-- N_ROWKEY: string (nullable = true)
 |-- N_ROWVALUE: integer (nullable = true)

因此,在您schema作为

之后,您应该超过outer join
sqlContext.sql("select * from TEMP1 a FULL OUTER JOIN TEMP2 b ON a.T_ROWKEY = b.N_ROWKEY").createOrReplaceTempView("JOINED")

然后简单的case when then else end应该为您提供您期望的最终结果

sqlContext.sql("select case when T_ROWKEY is null then `N_ROWKEY` else `T_ROWKEY` end as ROWKEY, case when T_ROWVALUE is null then 0 else `T_ROWVALUE` end  + case when N_ROWVALUE is null then 0 else `N_ROWVALUE` end as VALUE  from JOINED").show(false)

应该给你

+------------+-----+
|ROWKEY      |VALUE|
+------------+-----+
|14310AA872  |1007 |
|19792WS37   |2    |
|5584D2379   |4    |
|40513SS68   |1    |
|11716817SD8 |2    |
|11429401AW5 |3    |
|118732EE7792|3    |
|171539C177  |1    |
|187755CD00  |4    |
|8131128ER1  |7    |
|2552998AS99 |12   |
|834413CV18  |1    |
|8157FF1915  |1    |
|2552195C312 |105  |
|48061B887   |1    |
|148BB41539  |6    |
|153054WE02  |1    |
|175831A638  |1    |
|199915UY72  |11   |
|185475XS2   |14   |
+------------+-----+

使用api

使用when otherwise 内置函数更简单,更简洁

import org.apache.spark.sql.functions._
joined.select(when('T_ROWKEY.isNull, 'N_ROWKEY).otherwise('T_ROWKEY).as("ROWKEY"),
              when('T_ROWVALUE.isNull, 0).otherwise('T_ROWVALUE) + when('N_ROWVALUE.isNull, 0).otherwise('N_ROWVALUE) as "VALUE")
  .show(false)

应该给你上面的结果

我希望答案很有帮助