我希望这个问题以前没有解决过。我有一个由18列组成的数据集。 14列具有数值数据,4列具有分类类型。我将应用线性回归算法,但在此之前,我想缩放数字数据。为了做到这一点,我首先删除了分类的,对数字进行了缩放,然后与缩放后的合并。问题在于,在合并两个子数据集之后,分类数据将与训练数据集的比例合并。
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=5)
X_train_sub = X_train[['waterfront','view', 'basement', 'renovated']]
col_names = list(X_train_sub)
for col in col_names:
X_train_sub[col] = X_train_sub[col].astype('category',copy=False)
X_train_sub info()
X_train_sub.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16209 entries, 10306 to 2915
Data columns (total 4 columns):
waterfront 16209 non-null category
view 16209 non-null category
basement 16209 non-null category
renovated 16209 non-null category
dtypes: category(4)
删除分类变量后缩放训练数据
sc = StandardScaler()
X_scaled = X_train.drop(['waterfront','view', 'basement', 'renovated'], axis=1)
X_scaled = pd.DataFrame(sc.fit_transform(X_scaled),
columns=X_scaled.columns.values)
重新添加列
X_scaled[['waterfront','view', 'basement', 'renovated']] = X_train_sub
X_scaled.info()
Data columns (total 18 columns):
bedrooms 16209 non-null float64
bathrooms 16209 non-null float64
sqft_living 16209 non-null float64
sqft_lot 16209 non-null float64
floors 16209 non-null float64
condition 16209 non-null float64
grade 16209 non-null float64
sqft_above 16209 non-null float64
yr_built 16209 non-null float64
zipcode 16209 non-null float64
lat 16209 non-null float64
long 16209 non-null float64
sqft_living15 16209 non-null float64
sqft_lot15 16209 non-null float64
waterfront 12143 non-null category
view 12143 non-null category
basement 12143 non-null category
renovated 12143 non-null category
dtypes: category(4), float64(14)
答案 0 :(得分:1)
我认为这是一个对齐问题。以下代码破坏了原始索引
X_scaled = pd.DataFrame(sc.fit_transform(X_scaled),
columns=X_scaled.columns.values)
尝试
X_scaled = pd.DataFrame(sc.fit_transform(X_scaled),
columns=X_scaled.columns.values,
index=X_scaled.index)