Question

我正在尝试向我的数据帧添加几个新列（最好在for循环中），每个新列都是按col B分组后的column A某些实例的计数。 / p>

什么不起作用：

import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859|    4|
|947|    2|
|282|    6|
|699|   24|
|153|   12|

# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count() 
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174|   18|
|153|   20|
|630|    6|
|147|   16|

#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))

错误：

AttributeError：“ DataFrame”对象没有属性“ _get_object_id”

我也尝试了不使用f.col且仅使用df_g2.count的情况，但是我收到一条错误消息，说“ col应该是列”。

行之有效的事情：

df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])

但是，我要增加大约1000个新列，而拥有这么多的联接似乎代价很高。我想知道是否不可避免要进行联接，因为每次我按col A分组时，其顺序都会在分组对象中发生变化（例如df_group中column A的比较顺序与在df_g2中的顺序）上面），或者有更好的方法。

Answer 1

您可能需要的是groupby和pivot。试试这个：

df.groupby('A').pivot('B').agg(F.count('B')).show()

为分组的pyspark数据框创建多个列

1 个答案: