我试图将PySpark中的列除以它们各自的总和。我的数据框(这里只使用一列)如下所示:
event_rates = [[1,10.461016949152542], [2, 10.38953488372093], [3, 10.609418282548477]]
event_rates = spark.createDataFrame(event_rates, ['cluster_id','mean_encoded'])
event_rates.show()
+----------+------------------+
|cluster_id| mean_encoded|
+----------+------------------+
| 1|10.461016949152542|
| 2| 10.38953488372093|
| 3|10.609418282548477|
+----------+------------------+
我尝试了两种方法来做到这一点,但未能获得结果
from pyspark.sql.functions import sum as spark_sum
cols = event_rates.columns[1:]
for each in cols:
event_rates = event_rates.withColumn(each+"_scaled", event_rates[each]/spark_sum(event_rates[each]))
这给了我以下错误
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`cluster_id`' is not an aggregate function. Wrap '((`mean_encoded` / sum(`mean_encoded`)) AS `mean_encoded_scaled`)' in windowing function(s) or wrap '`cluster_id`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [cluster_id#22356L, mean_encoded#22357, (mean_encoded#22357 / sum(mean_encoded#22357)) AS mean_encoded_scaled#2
并在问题here之后我尝试了以下
stats = (event_rates.agg([spark_sum(x).alias(x + '_sum') for x in cols]))
event_rates = event_rates.join(broadcast(stats))
exprs = [event_rates[x] / event_rates[event_rates + '_sum'] for x in cols]
event_rates.select(exprs)
但我从第一行陈述
时收到错误AssertionError: all exprs should be Column
我如何解决这个问题?
答案 0 :(得分:0)
试试这个,
from pyspark.sql import functions as F
total = event_rates.groupBy().agg(F.sum("mean_encoded"),F.sum("cluster_id")).collect()
total
答案是,
[行(sum(mean_encoded)= 31.459970115421946,sum(cluster_id)= 6)]