Question

我正在尝试将不同的聚合函数应用于pyspark数据帧中的不同列。关于stackoverflow的一些建议，我尝试了这个：

the_columns = ["product1","product2"]
the_columns2 = ["customer1","customer2"]

exprs = [mean(col(d)) for d in the_columns1, count(col(c)) for c in the_columns2]

接着是

 df.groupby(*group).agg(*exprs)

其中“group”是the_columns或the_columns2中不存在的列。这不起作用。如何在不同的列上进行不同的聚合功能？

Answer 1

你已经非常接近，而不是将表达式放在一个列表中，添加它们以便你有一个平面的表达式列表：

exprs = [mean(col(d)) for d in the_columns1] + [count(col(c)) for c in the_columns2]

这是一个演示：

import pyspark.sql.functions as F

df.show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1|  2|  1|
|  1|  2|  2|  2|
|  2|  3|  3|  3|
|  2|  4|  3|  4|
+---+---+---+---+

cols = ['b']
cols2 = ['c', 'd']    

exprs = [F.mean(F.col(x)) for x in cols] + [F.count(F.col(x)) for x in cols2]

df.groupBy('a').agg(*exprs).show()
+---+------+--------+--------+
|  a|avg(b)|count(c)|count(d)|
+---+------+--------+--------+
|  1|   1.5|       2|       2|
|  2|   3.5|       2|       2|
+---+------+--------+--------+

不同列上的不同聚合操作pyspark

1 个答案: