Question

我有一组已经适合回归模型的缩放数据。

在引入单个要预测的样本时，您打算在预测之前缩放此输入吗？

我可以concat到原始数据帧，重新缩放并提取底行。但这不会造成数据泄漏吗？对？我还必须调整模型吗？

应对这种情况的正确方法是什么？

Answer 1

您应该使用之前训练的模型来扩展测试数据。

如果将这行插入原始数据帧中，而不是导致数据泄漏的正确方法，那么您将无法以这种方式查看生产中的真实数据。

比方说，您有多个这样的样本，并且您决定再次对缩放器建模，以查看此新数据，这被认为是不正确的做法，并且会导致数据泄漏，经过训练的查看火车数据的原始Scaler模型应仅使用。

对我来说有趣的是，如果您的训练和测试数据具有不同的分布，那会发生什么，无论您选择缩放策略的程度如何，它都无法与测试数据很好地配合，这很有用。
link描述了问题和可能的解决方案。

这里是用于扩展火车和测试数据的示例，摘录自-here

import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import cm
from sklearn.preprocessing import RobustScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

## load the dataset
dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target
##split into train and test
X_train,X_test,y_train,y_test = train_test_split(X_full,y_full)


## initialize the scaler
scale = RobustScaler()

### you are fitting the scaler and then transforming the data
## the scaler looks at the data in the train set and creates a model
## which will be used to transform the data
X_train_scaled = scale.fit_transform(X_train)
print(X_train)
print(X_train_scaled)


#### scale has been fitted once , you should be using this now
### on all test/ predict data that come in
### hence the below line only applies transform on the data
### if you are going to fit again that would mean data-leakage
X_test_scale = scale.transform(X_test)

Answer 2

此示例使用`MinMaxScaler`缩放数据，但是相同的原理适用于`all`情况。

摘要过程：

第1步：将scaler放在TRAINING data上
第2步：使用scaler至transform the training data
第3步：使用transformed training data至fit the predictive model
第4步：使用scaler至transform the TEST data
第5步：使用predict和trained model的{{1}}

使用虹膜数据的示例：

transformed TEST data

希望这会有所帮助。干杯

如何在sklearn中缩放单个样本以进行预测？

2 个答案:

此示例使用`MinMaxScaler`缩放数据，但是相同的原理适用于`all`情况。

如何在sklearn中缩放单个样本以进行预测？

2 个答案:

此示例使用MinMaxScaler缩放数据，但是相同的原理适用于all情况。

此示例使用`MinMaxScaler`缩放数据，但是相同的原理适用于`all`情况。