加权中位pyspark数据帧

时间:2019-12-17 13:12:01

标签: python pyspark pyspark-dataframes

要计算 行明智 的加权中位数,我已经编写了这段代码。生成的值将为空,我在哪里出错? col_A是值,而col_B是与这些值关联的权重。

代码:

def get_median(values,weights):
    return np.median(np.repeat(values,weights))    # function created to calculate wt. median

wimedian = F.udf(get_median,DoubleType())    # registering as udf here

myview = df.groupBy('category').agg(
    F.collect_list(F.col('col_A')),
    F.collect_list(F.col('col_B'))
).withColumn('Weighted_median',wimedian(F.col('col_A'),F.col('col_B')))

myview.show(3)

输出表:

+-----------+--------+-------+---------------+
|category   |col_A   |col_B  |Weighted_median|
+-----------+--------+-------+---------------+
|001        |[69]    |[8]    |null           |
|002        |[69]    |[14]   |null           |
|003        |[28, 21]|[3, 1] |null           |
+-----------+--------+-------+---------------+

仅供参考,此表中第3行的正确输出应为median of [28,28,28,21] = 28。 这就是np.mediannp.repeat的原因。

1 个答案:

答案 0 :(得分:1)

问题似乎是返回类型,因为数据框无法理解numpy类型,并且withColumn语句中的列引用也不正确

我将类型转换为float并正在运行

def get_median(values,weights):
    return float(np.median(np.repeat(values,weights)))

wimedian = F.udf(get_median,DoubleType())
df = sc.parallelize([["001",69,8],["002",69,14],["003",28,3],["003",21,1]]).toDF(["category","col_A","col_B"])

myview = df.groupBy('category').agg(
    F.collect_list(F.col('col_A')),
    F.collect_list(F.col('col_B'))).withColumn('Weighted_median',wimedian(F.col("collect_list(col_A)"),F.col("collect_list(col_B)"))).show()