Spark ML StandardScaler与Sklearn StandardScaler(with_std = True和with_mean = False)

时间:2018-09-17 20:40:21

标签: apache-spark scikit-learn apache-spark-mllib apache-spark-ml apache-spark-2.2

我正在尝试将sklearn的函数重写为涉及StandardScaler的spark ml。我试图通过设置with_std = True和with_mean = False来将其设置为单位方差。但是,Sklearn和Spark ML的结果并不相同。 sklearn的实现是一个方差,大约是0.80,而Spark ML则不是1。

sklearn

#baseline vector pdf
    a   b
0   3   5 
1   7   11
2   13  17

#translate origin to baseline_vectors' center
translate = StandardScaler(with_mean=False, with_std=True).fit(baseline_vectors_pdf)
baseline_translated = translate.transform(baseline_vectors_pdf)
# output:

array([[0.7299964 , 1.02062073],
       [1.70332492, 2.2453656 ],
       [3.16331771, 3.47011047]])

Spark ML

df = spark.createDataFrame([(3,5),(7,11),(13,17)], ["a", "b"])
vecAssembler = VectorAssembler(inputCols=["a", "b"], outputCol="features")
df = vecAssembler.transform(df)
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                         withMean=False,withStd=True)
scalerModel = scaler.fit(df)
transformed_df = scalerModel.transform(df)
transformed_df.take(3)
# output:

[Row(a=3, b=5, features=DenseVector([3.0, 5.0]), scaledFeatures=DenseVector([0.596, 0.8333])),
 Row(a=7, b=11, features=DenseVector([7.0, 11.0]), scaledFeatures=DenseVector([1.3908, 1.8333])),
 Row(a=13, b=17, features=DenseVector([13.0, 17.0]), scaledFeatures=DenseVector([2.5828, 2.8333]))]

我进行了一些手工计算,以查看这两种情况下的方差是多少,而对于Spark ML,方差不匹配一个。

Spark ML(withMean = False和withStd = True)

第一个变量:

平均值 =(0.596 + 1.3908 + 2.5828)/ 3 = 1.5232

方差 = sqrt((((0.596-1.5232)^ 2 +(1.3908-1.5232)^ 2 +(2.5828-1.5232)^ 2)/ 3)= 0.8164928577

第二变量:

平均值 =(0.8333 + 1.8333 + 2.8333)/ 3 = 1.8333

方差 = sqrt((((0.8333-1.8333)^ 2 +(1.8333-1.8333)^ 2 +(2.8333-1.8333)^ 2)/ 3)= 0.8164965809

Sklearn(withMean = False,withStd = True)

第一个变量:

平均值 =(0.7299964 + 1.70332492 + 3.16331771)/ 3 = 1.8655463433

差异 = sqrt((((0.7299964-1.8655463433)^ 2 +(1.70332492-1.8655463433)^ 2 +(3.16331771-1.8655463433)^ 2)/ 3)= 0.9999999974

第二变量:

平均值 =(1.02062073 + 2.2453656 + 3.47011047)/ 3 = 2.2453656

方差 = sqrt((((1.02062073-2.2453656)^ 2 +(2.2453656-2.2453656)^ 2 +(3.47011047-2.2453656)^ 2)/ 3)= 0.9999999989

0 个答案:

没有答案