PySpark中的Spark Pivot字符串

时间:2016-11-05 20:31:35

标签: apache-spark pivot pyspark

使用Spark重构数据时遇到问题。原始数据如下所示:

df = sqlContext.createDataFrame([
    ("ID_1", "VAR_1", "Butter"),
    ("ID_1", "VAR_2", "Toast"),
    ("ID_1", "VAR_3", "Ham"),
    ("ID_2", "VAR_1", "Jam"),
    ("ID_2", "VAR_2", "Toast"),
    ("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])

>>> df.show()
+----+-----+------+
|  ID|  VAR|   VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3|   Ham|
|ID_2|VAR_1|   Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3|   Egg|
+----+-----+------+

这是我试图实现的结构:

+----+------+-----+-----+
|  ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast|  Ham|
|ID_2|   Jam|Toast|  Egg|
+----+------+-----+-----+

我的想法是使用:

df.groupBy("ID").pivot("VAR").show()

但是我收到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'

任何建议!谢谢!

1 个答案:

答案 0 :(得分:1)

您需要在pivot()之后添加聚合。如果您确定每个(“ID”,“VAR”)对只有一个“VAL”,您可以使用first():

from pyspark.sql import functions as f

result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()

+----+------+-----+-----+
|  ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast|  Ham|
|ID_2|   Jam|Toast|  Egg|
+----+------+-----+-----+