在数据框的列中添加缺少的类别

时间:2019-04-15 05:47:00

标签: scala apache-spark apache-spark-dataset

我正在关注以下spark dataFrame。列国家/地区中有10个不同的值。我希望获得预期结果中给出的新数据框。

DataFrame
+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|     Northwest|              0.87|
|            C|     Southwest|              0.44|
+-------------+--------------+------------------+

Distinct values for country column are :
+--------------+
|       country|
+--------------+
|     Australia|
|        Canada|
|       Central|
|        France|
|       Germany|
|     Northeast|
|     Northwest|
|     Southeast|
|     Southwest|
|United Kingdom|
+--------------+

Expected Result :

+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|     Australia|              null|
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            B|        Canada|              null|
|            B|       Central|              null|
|            B|        France|              null|
|            B|       Germany|              null|
|            B|     Northeast|              null|
|            B|     Northwest|              null|
|            B|     Southeast|              null|
|            B|     Southwest|              null|
|            B|United Kingdom|              null|
|            C|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|       Central|              null|
|            C|        France|              null|
|            C|       Germany|              null|
|            C|     Northeast|              null|
|            C|     Northwest|              0.87|
|            C|     Southeast|              null|
|            C|     Southwest|              0.44|
|            C|United Kingdom|              null|

如何在Scala中实现此预期输出?我已经为数据集引用了函数/方法,但是无法找到任何可以帮助我开始的线索。

  

请注意,可能会有多列,因此对于多列   逻辑是一样的,我想针对每个插入缺失的类别   所有列中的类别。

我是火花scala的初学者。在此先感谢:)

1 个答案:

答案 0 :(得分:1)

将不同的代码与国家/地区交叉连接,然后将其与原始表连接起来 像

val codes= data.select($"Code").distinct
val combinations = codes.crossJoin(countries)
val result = combinations.join(data, combinations("code")===data("code") && combinations("country")===data("country"),"leftouter").select(combinations("code"),combinations("coiuntry"),data("t1")).orderBy($"code",$"value")