对火花数据帧进行去标准化/转置

时间:2016-09-28 06:47:03

标签: scala apache-spark

提供以下数据框,如何将行中的数据转换为列。 未定义属性名称列表。也就是说,可以有比这里定义的更多的属性。我在apache spark中使用scala寻找代码示例

UserCode    | PropertyName          | PropertyValue 
1           | First Name            | Ram
1           | Last Name             | Shri 
1           | Address               | Ayodhya 
2           | First Name            | Laxman 
2           | Lastname              | Shri 
2           | Address               | Ayodhya
2           | Skill                 | Archery 
2           | Mariatal Status       | Married 
2           | Age                   | 23 
3           | First Name            | Sita 
3           | Last Name             | Devi
3           | Address               | Ayodhya

预期输出

UserCode    | First Name            | Last Name | Address | Skill   | Age
1           | Ram                   | Shri      | Ayodhya |         |       
2           | Laxman                | Shri      | Ayodhya | Archery | 23
3           | Sita                  | Devi      | Ayodhya |         |   

1 个答案:

答案 0 :(得分:0)

如果你可以使用数据透视表,这非常简单。

val df = Seq(
(1, "First Name", "Ram"),
(1, "Last Name", "Shri"),
(1, "Address", "Ayodhya"),
(2, "First Name", "Laxman"),
(2, "Last Name", "Shri"),
(2, "Address", "Ayodhya"),
(2, "Skill", "Archery"),
(2, "Marital Status", "Married"),
(2, "Age", "23"),
(3, "First Name", "Sita"),
(3, "Last Name", "Devi"),
(3, "Address", "Ayodhya")
).toDF("userCode", "propertyName", "propertyValue")

df.groupBy("userCode").pivot("propertyName").agg(first("propertyValue")).show

+--------+-------+----+----------+---------+--------------+-------+
|userCode|Address| Age|First Name|Last Name|Marital Status|  Skill|
+--------+-------+----+----------+---------+--------------+-------+
|       1|Ayodhya|null|       Ram|     Shri|          null|   null|
|       2|Ayodhya|  23|    Laxman|     Shri|       Married|Archery|
|       3|Ayodhya|null|      Sita|     Devi|          null|   null|
+--------+-------+----+----------+---------+--------------+-------+