似乎应该可以,但出现错误:
mu = mean(df[input])
sigma = stddev(df[input])
dft = df.withColumn(output, (df[input]-mu)/sigma)
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and '`user`' is not an aggregate function. Wrap '(((CAST(`sum(response)` AS DOUBLE) - avg(`sum(response)`)) / stddev_samp(CAST(`sum(response)` AS DOUBLE))) AS `scaled`)' in windowing function(s) or wrap '`user`' in first() (or first_value) if you don't care which value you get.;;\nAggregate [user#0, sum(response)#26L, ((cast(sum(response)#26L as double) - avg(sum(response)#26L)) / stddev_samp(cast(sum(response)#26L as double))) AS scaled#46]\n+- AnalysisBarrier\n +- Aggregate [user#0], [user#0, sum(cast(response#3 as bigint)) AS sum(response)#26L]\n +- Filter item_id#1 IN (129,130,131,132,133,134,135,136,137,138)\n +- Relation[user#0,item_id#1,response_value#2,response#3,trait#4,response_timestamp#5] csv\n"
我不确定此错误消息是怎么回事。
答案 0 :(得分:1)
一般来说,使用collect()
并不是一个好的解决方案,并且您会发现它不会随着数据的增长而扩展。
如果您不想使用StandardScaler
,更好的方法是使用Window
计算平均值和标准偏差。
从StandardScaler in Spark not working as expected借用相同的示例:
from pyspark.sql.functions import col, mean, stddev
from pyspark.sql import Window
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
假设您要标准化列int2
:
input_col = "int2"
output_col = "int2_scaled"
w = Window.partitionBy()
mu = mean(input_col).over(w)
sigma = stddev(input_col).over(w)
df.withColumn(output_col, (col(input_col) - mu)/(sigma)).show()
#+----+----+----+-----------+
#|int1|int2|int3|int2_scaled|
#+----+----+----+-----------+
#| 1| 2| 3| -1.0|
#| 7| 8| 9| 1.0|
#| 4| 5| 6| 0.0|
#+----+----+----+-----------+
如果要像其他示例一样使用总体标准差,请将pyspark.sql.functions.stddev
替换为pyspark.sql.functions.stddev_pop()
。
答案 1 :(得分:0)
幸运的是,我能够找到有效的代码:
summary = df.select([mean(input).alias('mu'), stddev(input).alias('sigma')])\
.collect().pop()
dft = df.withColumn(output, (df[input]-summary.mu)/summary.sigma)