PySpark将列除以其总和

时间:2018-05-23 16:37:28

标签: python apache-spark pyspark

我试图将PySpark中的列除以它们各自的总和。我的数据框(这里只使用一列)如下所示:

event_rates = [[1,10.461016949152542], [2, 10.38953488372093], [3, 10.609418282548477]]
event_rates = spark.createDataFrame(event_rates, ['cluster_id','mean_encoded'])
event_rates.show()

+----------+------------------+
|cluster_id|      mean_encoded|
+----------+------------------+
|         1|10.461016949152542|
|         2| 10.38953488372093|
|         3|10.609418282548477|
+----------+------------------+

我尝试了两种方法来做到这一点,但未能获得结果

from pyspark.sql.functions import sum as spark_sum
cols = event_rates.columns[1:]
for each in cols:
    event_rates = event_rates.withColumn(each+"_scaled", event_rates[each]/spark_sum(event_rates[each]))

这给了我以下错误

org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`cluster_id`' is not an aggregate function. Wrap '((`mean_encoded` / sum(`mean_encoded`)) AS `mean_encoded_scaled`)' in windowing function(s) or wrap '`cluster_id`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [cluster_id#22356L, mean_encoded#22357, (mean_encoded#22357 / sum(mean_encoded#22357)) AS mean_encoded_scaled#2

并在问题here之后我尝试了以下

stats = (event_rates.agg([spark_sum(x).alias(x + '_sum') for x in cols]))
event_rates = event_rates.join(broadcast(stats))
exprs = [event_rates[x] / event_rates[event_rates + '_sum'] for x in cols]
event_rates.select(exprs)

但我从第一行陈述

时收到错误
AssertionError: all exprs should be Column

我如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

试试这个,

from pyspark.sql import functions as F
total = event_rates.groupBy().agg(F.sum("mean_encoded"),F.sum("cluster_id")).collect()
total

答案是,

  

[行(sum(mean_encoded)= 31.459970115421946,sum(cluster_id)= 6)]