我正在尝试将sklearn的函数重写为涉及StandardScaler的spark ml。我试图通过设置with_std = True和with_mean = False来将其设置为单位方差。但是,Sklearn和Spark ML的结果并不相同。 sklearn的实现是一个方差,大约是0.80,而Spark ML则不是1。
#baseline vector pdf
a b
0 3 5
1 7 11
2 13 17
#translate origin to baseline_vectors' center
translate = StandardScaler(with_mean=False, with_std=True).fit(baseline_vectors_pdf)
baseline_translated = translate.transform(baseline_vectors_pdf)
# output:
array([[0.7299964 , 1.02062073],
[1.70332492, 2.2453656 ],
[3.16331771, 3.47011047]])
df = spark.createDataFrame([(3,5),(7,11),(13,17)], ["a", "b"])
vecAssembler = VectorAssembler(inputCols=["a", "b"], outputCol="features")
df = vecAssembler.transform(df)
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withMean=False,withStd=True)
scalerModel = scaler.fit(df)
transformed_df = scalerModel.transform(df)
transformed_df.take(3)
# output:
[Row(a=3, b=5, features=DenseVector([3.0, 5.0]), scaledFeatures=DenseVector([0.596, 0.8333])),
Row(a=7, b=11, features=DenseVector([7.0, 11.0]), scaledFeatures=DenseVector([1.3908, 1.8333])),
Row(a=13, b=17, features=DenseVector([13.0, 17.0]), scaledFeatures=DenseVector([2.5828, 2.8333]))]
我进行了一些手工计算,以查看这两种情况下的方差是多少,而对于Spark ML,方差不匹配一个。
第一个变量:
平均值 =(0.596 + 1.3908 + 2.5828)/ 3 = 1.5232
方差 = sqrt((((0.596-1.5232)^ 2 +(1.3908-1.5232)^ 2 +(2.5828-1.5232)^ 2)/ 3)= 0.8164928577
第二变量:
平均值 =(0.8333 + 1.8333 + 2.8333)/ 3 = 1.8333
方差 = sqrt((((0.8333-1.8333)^ 2 +(1.8333-1.8333)^ 2 +(2.8333-1.8333)^ 2)/ 3)= 0.8164965809
第一个变量:
平均值 =(0.7299964 + 1.70332492 + 3.16331771)/ 3 = 1.8655463433
差异 = sqrt((((0.7299964-1.8655463433)^ 2 +(1.70332492-1.8655463433)^ 2 +(3.16331771-1.8655463433)^ 2)/ 3)= 0.9999999974
第二变量:
平均值 =(1.02062073 + 2.2453656 + 3.47011047)/ 3 = 2.2453656
方差 = sqrt((((1.02062073-2.2453656)^ 2 +(2.2453656-2.2453656)^ 2 +(3.47011047-2.2453656)^ 2)/ 3)= 0.9999999989