透过Dataframe列转换用户ID Spark

时间:2017-12-28 11:26:52

标签: scala apache-spark spark-dataframe

我的数据框看起来像

+------+------------+------------------+
|UserID|Attribute   | Value            |
+------+------------+------------------+
|123   |  City      | San Francisco    |
|123   |  Lang      | English          |
|111   |  Lang      | French           |
|111   |  Age       | 23               |
|111   |  Gender    | Female           |
+------+------------+------------------+

所以我有一些不同的属性,对于某些用户来说可以为空(有限的属性比如说最多20个)

我想将此DF转换为

+-----+--------------+---------+-----+--------+
|User |City          | Lang    | Age | Gender |
+-----+--------------+---------+-----+--------+
|123  |San Francisco | English | NULL| NULL   |
|111  |          NULL| French  | 23  | Female |
+-----+--------------+---------+-----+--------+

我对Spark和Scala很新。

1 个答案:

答案 0 :(得分:2)

您可以使用pivot获取所需的输出:

import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._

df.groupBy("UserID")
  .pivot("Attribute")
  .agg(first("Value")).show()    

这将为您提供所需的输出:

+------+----+-------------+------+-------+
|UserID| Age|         City|Gender|   Lang|
+------+----+-------------+------+-------+
|   111|  23|         null|Female| French|
|   123|null|San Francisco|  null|English|
+------+----+-------------+------+-------+