Scikit学习标准缩放器

时间:2018-08-07 04:54:36

标签: python-3.x pandas scikit-learn

我希望这个问题以前没有解决过。我有一个由18列组成的数据集。 14列具有数值数据,4列具有分类类型。我将应用线性回归算法,但在此之前,我想缩放数字数据。为了做到这一点,我首先删除了分类的,对数字进行了缩放,然后与缩放后的合并。问题在于,在合并两个子数据集之后,分类数据将与训练数据集的比例合并。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=5)
X_train_sub = X_train[['waterfront','view', 'basement', 'renovated']]
col_names = list(X_train_sub)
for col in col_names:
  X_train_sub[col] = X_train_sub[col].astype('category',copy=False)

X_train_sub info()

X_train_sub.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16209 entries, 10306 to 2915
Data columns (total 4 columns):
waterfront    16209 non-null category
view          16209 non-null category
basement      16209 non-null category
renovated     16209 non-null category
dtypes: category(4)

删除分类变量后缩放训练数据

sc = StandardScaler()
X_scaled = X_train.drop(['waterfront','view', 'basement', 'renovated'], axis=1)
X_scaled = pd.DataFrame(sc.fit_transform(X_scaled), 
columns=X_scaled.columns.values)

重新添加列

X_scaled[['waterfront','view', 'basement', 'renovated']] = X_train_sub
X_scaled.info()

Data columns (total 18 columns):
bedrooms         16209 non-null float64
bathrooms        16209 non-null float64
sqft_living      16209 non-null float64
sqft_lot         16209 non-null float64
floors           16209 non-null float64
condition        16209 non-null float64
grade            16209 non-null float64
sqft_above       16209 non-null float64
yr_built         16209 non-null float64
zipcode          16209 non-null float64
lat              16209 non-null float64
long             16209 non-null float64
sqft_living15    16209 non-null float64
sqft_lot15       16209 non-null float64
waterfront       12143 non-null category
view             12143 non-null category
basement         12143 non-null category
renovated        12143 non-null category
dtypes: category(4), float64(14)      

1 个答案:

答案 0 :(得分:1)

我认为这是一个对齐问题。以下代码破坏了原始索引

X_scaled = pd.DataFrame(sc.fit_transform(X_scaled), 
                        columns=X_scaled.columns.values)

尝试

X_scaled = pd.DataFrame(sc.fit_transform(X_scaled), 
                        columns=X_scaled.columns.values,
                        index=X_scaled.index)