使用Spark重构数据时遇到问题。原始数据如下所示:
df = sqlContext.createDataFrame([
("ID_1", "VAR_1", "Butter"),
("ID_1", "VAR_2", "Toast"),
("ID_1", "VAR_3", "Ham"),
("ID_2", "VAR_1", "Jam"),
("ID_2", "VAR_2", "Toast"),
("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])
>>> df.show()
+----+-----+------+
| ID| VAR| VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3| Ham|
|ID_2|VAR_1| Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3| Egg|
+----+-----+------+
这是我试图实现的结构:
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
我的想法是使用:
df.groupBy("ID").pivot("VAR").show()
但是我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'
任何建议!谢谢!
答案 0 :(得分:1)
您需要在pivot()之后添加聚合。如果您确定每个(“ID”,“VAR”)对只有一个“VAL”,您可以使用first():
from pyspark.sql import functions as f
result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+