我的数据框看起来像
+------+------------+------------------+
|UserID|Attribute | Value |
+------+------------+------------------+
|123 | City | San Francisco |
|123 | Lang | English |
|111 | Lang | French |
|111 | Age | 23 |
|111 | Gender | Female |
+------+------------+------------------+
所以我有一些不同的属性,对于某些用户来说可以为空(有限的属性比如说最多20个)
我想将此DF转换为
+-----+--------------+---------+-----+--------+
|User |City | Lang | Age | Gender |
+-----+--------------+---------+-----+--------+
|123 |San Francisco | English | NULL| NULL |
|111 | NULL| French | 23 | Female |
+-----+--------------+---------+-----+--------+
我对Spark和Scala很新。
答案 0 :(得分:2)
您可以使用pivot
获取所需的输出:
import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._
df.groupBy("UserID")
.pivot("Attribute")
.agg(first("Value")).show()
这将为您提供所需的输出:
+------+----+-------------+------+-------+
|UserID| Age| City|Gender| Lang|
+------+----+-------------+------+-------+
| 111| 23| null|Female| French|
| 123|null|San Francisco| null|English|
+------+----+-------------+------+-------+