Question

我有一个数据框：

student_id class score
1 A 6
1 B 7
1 C 8

我想将class得分分为3列，因此上述数据框应变为：

student_id class_A_score class_B_score class_C_score
1 6 7 8

想法是将A B C转换为3列。

Answer 1

这是数据透视的经典示例。在pyspark中，如果df是您的数据帧：

new_df = df.groupBy(['student_id']).pivot('class').sum(score)

Databricks在https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

上对此有很好的说明。

Answer 2

values = [(1,'A',6),(1,'B',7),(1,'C',8)]
df = sqlContext.createDataFrame(values,['student_id','class','score'])
df.show()
+----------+-----+-----+
|student_id|class|score|
+----------+-----+-----+
|         1|    A|    6|
|         1|    B|    7|
|         1|    C|    8|
+----------+-----+-----+
df = df.groupBy(["student_id"]).pivot("class").sum("score")
df.show()
+----------+---+---+---+
|student_id|  A|  B|  C|
+----------+---+---+---+
|         1|  6|  7|  8|
+----------+---+---+---+

pySpark中的枢轴

2 个答案: