Question

我有一个只有两列的数据框。我正在尝试将一列的值转换为标题，并将另一列的值转换为其值。尝试使用数据透视及其他所有方法，但均无法正常工作。

df_pivot_test = sc.parallelize([('a',1), ('b',1), ('c',2), ('d',2), ('e',10)]).toDF(["id","score"])

id  score
a   1
b   1
c   3
d   6
e   10

试图将其转换为

a   b   c   d   e
1   1   3   6   10

对我们如何做到这一点有何想法？我不想使用.toPandas（），我们可以通过转换为pandas数据框来实现它。但是我们有数十亿行，因此我们将遇到内存问题。

Answer 1

您可以 pivot and groupBy 来获得所需的结果。

Try with this method:

from pyspark.sql.functions import *

# with literal value in groupby clause

df_pivot_test.groupBy(lit(1)).pivot("id").agg(expr("first(score)")).drop("1").show()

(or)

# without any column in groupby clause
df_pivot_test.groupBy().pivot("id").agg(expr("first(score)")).show()

Result:

+---+---+---+---+---+
|  a|  b|  c|  d|  e|
+---+---+---+---+---+
|  1|  1|  2|  2| 10|
+---+---+---+---+---+

不使用熊猫即可转换行和列

1 个答案: