我有一个数据框:
student_id class score
1 A 6
1 B 7
1 C 8
我想将class
得分分为3列,因此上述数据框应变为:
student_id class_A_score class_B_score class_C_score
1 6 7 8
想法是将A B C
转换为3列。
答案 0 :(得分:1)
这是数据透视的经典示例。在pyspark中,如果df
是您的数据帧:
new_df = df.groupBy(['student_id']).pivot('class').sum(score)
Databricks在https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
上对此有很好的说明。答案 1 :(得分:1)
values = [(1,'A',6),(1,'B',7),(1,'C',8)]
df = sqlContext.createDataFrame(values,['student_id','class','score'])
df.show()
+----------+-----+-----+
|student_id|class|score|
+----------+-----+-----+
| 1| A| 6|
| 1| B| 7|
| 1| C| 8|
+----------+-----+-----+
df = df.groupBy(["student_id"]).pivot("class").sum("score")
df.show()
+----------+---+---+---+
|student_id| A| B| C|
+----------+---+---+---+
| 1| 6| 7| 8|
+----------+---+---+---+