Question

所以，我怀疑并一直在寻找答案。所以问题是我什么时候使用，

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})

df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

之后，我将训练并测试模型（A，B作为特征，C作为标签）并获得一些准确度分数。现在我怀疑的是，当我必须预测新数据集的标签时会发生什么。说，

df = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

因为当我对列进行规范化时，A和B的值将根据新数据而不是模型将要训练的数据进行更改。那么，现在我的数据准备步骤之后的数据如下所示。

data[['A','B']] = min_max_scaler.fit_transform(data[['A','B']])

A和B的值将相对于Max的{{1}}和Min值发生变化。 df[['A','B']]的数据准备与df[['A','B']]的{{1}}相关。

对于不同的数字，数据准备如何有效？我不明白这里的预测是否正确。

Answer 1

您应该使用`MinMaxScaler`数据填充`training`，然后在预测之前在`testing`数据上应用缩放器。

总结：

第1步：将scaler放在TRAINING data
第2步：使用scaler至transform the training data
第3步：使用transformed training data至fit the predictive model
第4步：使用scaler至transform the TEST data
第5步：predict使用trained model和transformed TEST data

使用您的数据的示例：

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
#training data
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
#fit and transform the training data and use them for the model training
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

#fit the model
model.fit(df['A','B'])

#after the model training on the transformed training data define the testing data df_test
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

#before the prediction of the test data, ONLY APPLY the scaler on them
df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])

#test the model
y_predicted_from_model = model.predict(df_test['A','B'])

使用虹膜数据的示例：

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

data = datasets.load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = SVC()
model.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
y_pred = model.predict(X_test_scaled)

希望这有帮助。

Answer 2

最好的方法是训练并保存MinMaxScaler模型，并在需要时加载相同的模型。

保存模型：

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])  
pickle.dump(min_max_scaler, open("scaler.pkl", 'wb'))

加载保存的模型：

scalerObj = pickle.load(open("scaler.pkl", 'rb'))
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
df_test[['A','B']] = scalerObj.transform(df_test[['A','B']])

如何使用MinMaxScaler sklearn

2 个答案:

您应该使用`MinMaxScaler`数据填充`training`，然后在预测之前在`testing`数据上应用缩放器。

如何使用MinMaxScaler sklearn

2 个答案:

您应该使用MinMaxScaler数据填充training，然后在预测之前在testing数据上应用缩放器。

您应该使用`MinMaxScaler`数据填充`training`，然后在预测之前在`testing`数据上应用缩放器。