我正在尝试计算pyspark中的加权平均值,但没有取得很大进展
# Example data
df = sc.parallelize([
("a", 7, 1), ("a", 5, 2), ("a", 4, 3),
("b", 2, 2), ("b", 5, 4), ("c", 1, -1)
]).toDF(["k", "v1", "v2"])
df.show()
import numpy as np
def weighted_mean(workclass, final_weight):
return np.average(workclass, weights=final_weight)
weighted_mean_udaf = pyspark.sql.functions.udf(weighted_mean,
pyspark.sql.types.IntegerType())
但是当我尝试执行此代码时
df.groupby('k').agg(weighted_mean_udaf(df.v1,df.v2)).show()
我收到错误
u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get
我的问题是,我可以指定一个自定义函数(带多个参数)作为agg的参数吗?如果没有,是否有任何替代方法可以在按键分组后执行加权均值等操作?
答案 0 :(得分:3)
用户定义聚合函数(UDAF,适用于pyspark.sql.GroupedData
但在pyspark中不支持)不是用户定义函数(UDF,适用于pyspark.sql.DataFrame
)。
因为在pyspark中你不能创建自己的UDAF,并且提供的UDAF无法解决你的问题,你可能需要回到RDD世界:
from numpy import sum
def weighted_mean(vals):
vals = list(vals) # save the values from the iterator
sum_of_weights = sum(tup[1] for tup in vals)
return sum(1. * tup[0] * tup[1] / sum_of_weights for tup in vals)
df.map(
lambda x: (x[0], tuple(x[1:])) # reshape to (key, val) so grouping could work
).groupByKey().mapValues(
weighted_mean
).collect()