如何使用Scala在Spark数据框中的行外创建列

时间:2018-08-30 19:33:49

标签: scala apache-spark

我有如下输入数据框。

 +-------+----------------+------------+   
 |ID     |Title           |values      |    
 +-------+----------------+------------+  
 |ID-1   |First Name      |Jolly       |  
 |ID-1   |Middle Name     |Jr          |  
 |ID-1   |Last Name       |Hudson      |  
 |ID-2   |First Name      |Kathy       |  
 |ID-2   |Last Name       |Oliver      |  
 |ID-3   |Last Name       |Short       |  
 |ID-3   |Middle Name     |M           |  
 |ID-4   |First Name      |Denver      |  
 +-------+----------------+------------+   

我要求输出如下:

 +-------+----------------+---------------+--------------+  
 |ID     |First Name      |Middle Name    | Last Name    |   
 +-------+----------------+---------------+--------------+ 
 |ID-1   |Jolly           |Jr             | Hudson       |      
 |ID-2   |Kathy           |null           | Oliver       | 
 |ID-3   |null            |M              | Short        |
 |ID-4   |Denver          |null           | null         |
 +-------+----------------+---------------+--------------+   

请提出可能的解决方案以获取此结果。
预先感谢。

1 个答案:

答案 0 :(得分:0)

这是一种通过旋转Title以使用Values聚合first来对数据集进行分组的方法:

val df = Seq(
  ("ID-1", "First Name", "Jolly"),
  ("ID-1", "Middle Name", "Jr"),
  ("ID-1", "Last Name", "Hudson"),
  ("ID-2", "First Name", "Kathy"),
  ("ID-2", "Last Name", "Oliver"),
  ("ID-3", "Last Name", "Short"),
  ("ID-3", "Middle Name", "M"),
  ("ID-4", "First Name", "Denver")
).toDF("ID", "Title", "Values")

df.
  groupBy("ID").pivot("Title").agg(first($"Values")).
  show(false)
// +----+----------+---------+-----------+
// |ID  |First Name|Last Name|Middle Name|
// +----+----------+---------+-----------+
// |ID-1|Jolly     |Hudson   |Jr         |
// |ID-3|null      |Short    |M          |
// |ID-4|Denver    |null     |null       |
// |ID-2|Kathy     |Oliver   |null       |
// +----+----------+---------+-----------+