对于像Spark中的MinMaxScaler这样的缩放器,是否没有“inverse_transform”方法?

时间:2017-09-07 08:59:38

标签: apache-spark machine-learning normalization apache-spark-mllib inverse-transform

当训练模型时,比如线性回归,我们可以在列车上进行标准化,如MinMaxScaler,测试数据集。

在我们获得训练有素的模型并使用它进行预测后,将预测缩减为原始表示。

在python中,有“inverse_transform”方法。例如:

from sklearn.preprocessing import MinMaxScaler
scalerModel.inverse_transform

from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler()
MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)

dataScaled = scaler.fit(data).transform(data)
print(dataScaled)

scaler.inverse_transform(dataScaled)

火花中有类似的方法吗?

我搜索了很多,但没有找到答案。谁能给我一些建议? 非常感谢你!

3 个答案:

答案 0 :(得分:1)

在我们公司,为了解决StandardScaler上的同样问题,我们使用此扩展了spark.ml(除其他外):

package org.apache.spark.ml

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.util.Identifiable

package object feature {

    implicit class RichStandardScalerModel(model: StandardScalerModel) {

        private def invertedStdDev(sigma: Double): Double = 1 / sigma

        private def invertedMean(mu: Double, sigma: Double): Double = -mu / sigma

        def inverse(newOutputCol: String): StandardScalerModel = {
            val sigma: linalg.Vector = model.std
            val mu: linalg.Vector = model.mean
            val newSigma: linalg.Vector = new DenseVector(sigma.toArray.map(invertedStdDev))
            val newMu: linalg.Vector = new DenseVector(mu.toArray.zip(sigma.toArray).map { case (m, s) => invertedMean(m, s) })
            val inverted: StandardScalerModel = new StandardScalerModel(Identifiable.randomUID("stdScal"), newSigma, newMu)
                .setInputCol(model.getOutputCol)
                .setOutputCol(newOutputCol)

            inverted
                .set(inverted.withMean, model.getWithMean)
                .set(inverted.withStd, model.getWithStd)
        }
    }

}

修改它或为你的特定情况做类似的事情应该相当容易。

请记住,由于JVM的双重实现,您通常会在这些操作中失去精确度,因此您无法恢复转换前的确切原始值​​(例如:您可能会得到类似于1.9999999999999998而不是2.0)。

答案 1 :(得分:0)

这里没有直接的解决方案。 由于只能将数组传递给UDF,而不能将数组传递给UDF(lit(array)不能解决问题),所以我使用以下解决方法。

简而言之,它将反转的比例尺数组转换为字符串,将其传递给UDF,并求解数学。

您可以在逆函数中使用缩放后的数组(字符串)(也在此处附加),获取逆值。

代码:

from pyspark.ml.feature import VectorAssembler, QuantileDiscretizer
from pyspark.ml.linalg import SparseVector, DenseVector, Vectors, VectorUDT

df = spark.createDataFrame([
    (0, 1, 0.5, -1),
    (1, 2, 1.0, 1),
    (2, 4, 10.0, 2)
], ["id", 'x1', 'x2', 'x3'])

df.show()

def Normalize(df):

    scales = df.describe()
    scales = scales.filter("summary = 'mean' or summary = 'stddev'")
    scales = scales.select(["summary"] + [col(c).cast("double") for c in scales.columns[1:]])

    assembler = VectorAssembler(
         inputCols=scales.columns[1:],
         outputCol="X_scales")

    df_scales = assembler.transform(scales)

    x_mean = df_scales.filter("summary = 'mean'").select('X_scales')
    x_std = df_scales.filter("summary = 'stddev'").select('X_scales')

    ks_std_lit = lit('|'.join([str(s) for s in list(x_std.collect()[0].X_scales)]))
    ks_mean_lit = lit('|'.join([str(s) for s in list(x_mean.collect()[0].X_scales)]))

    assembler = VectorAssembler(
    inputCols=df.columns[0:4],
    outputCol="features")

    df_features = assembler.transform(df)
    df_features = df_features.withColumn('Scaled', exec_norm_udf(df_features.features, ks_mean_lit, ks_std_lit))

    return df_features, ks_mean_lit, ks_std_lit

def exec_norm(vector, x_mean, x_std):
    x_mean = [float(s) for s in x_mean.split('|')]
    x_std = [float(s) for s in x_std.split('|')]

    res = (np.array(vector) - np.array(x_mean)) / np.array(x_std)
    res = list(res)

    return Vectors.dense(res)


exec_norm_udf = udf(exec_norm, VectorUDT())


def scaler_invert(vector, x_mean, x_std):
    x_mean = [float(s) for s in x_mean.split('|')]
    x_std = [float(s) for s in x_std.split('|')]

    res = (np.array(vector) * np.array(x_std)) + np.array(x_mean)
    res = list(res)

    return Vectors.dense(res)


scaler_invert_udf = udf(scaler_invert, VectorUDT())


df, scaler_mean, scaler_std = Normalize(df)
df.withColumn('inverted', scaler_invert_udf(df.Scaled, scaler_mean, scaler_std)).show(truncate=False)

答案 2 :(得分:0)

但是,也许我来晚了,但是最近遇到了同样的问题,找不到可行的解决方案。

假设此问题的作者不必对向量的MinMax值求逆,则只需要对一列求逆。 已知列的Min Max值以及缩放器的min-max参数。

根据scikit学习网站的MinMaxScaler背后的运算法则:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

“反向工程” MinMaxScaler公式

X_scaled = (X - Xmin) / (Xmax) - Xmin) * (max - min) + min
X = (max * Xmin - min * Xmax - Xmin * X_scaled + Xmax * X_scaled)/(max - min)

实施

from sklearn.preprocessing import MinMaxScaler
import pandas

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)

data_sp = spark.createDataFrame(pandas.DataFrame(data, columns=["x", "y"]).join(pandas.DataFrame(dataScaled, columns=["x_scaled", "y_scaled"])))
data_sp.show()
print("Inversing column: y_scaled")
Xmax = data_sp.select("y").rdd.max()[0]
Xmin = data_sp.select("y").rdd.min()[0]
_max = scaler.feature_range[1]
_min = scaler.feature_range[0]

print("Xmax =", Xmax, "Xmin =", Xmin, "max =", _max, "min =", _min)
data_sp.withColumn(colName="y_scaled_inversed", col=(_max * Xmin - _min * Xmax - Xmin * data_sp.y_scaled + Xmax * data_sp.y_scaled)/(_max - _min)).show()

输出

[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
+----+---+--------+--------+
|   x|  y|x_scaled|y_scaled|
+----+---+--------+--------+
|-1.0|  2|     0.0|     0.0|
|-0.5|  6|    0.25|    0.25|
| 0.0| 10|     0.5|     0.5|
| 1.0| 18|     1.0|     1.0|
+----+---+--------+--------+

Inversing column: y_scaled
Xmax = 18 Xmin = 2 max = 1 min = 0
+----+---+--------+--------+-----------------+
|   x|  y|x_scaled|y_scaled|y_scaled_inversed|
+----+---+--------+--------+-----------------+
|-1.0|  2|     0.0|     0.0|              2.0|
|-0.5|  6|    0.25|    0.25|              6.0|
| 0.0| 10|     0.5|     0.5|             10.0|
| 1.0| 18|     1.0|     1.0|             18.0|
+----+---+--------+--------+-----------------+