我一直在尝试训练一个RandomForestRegressor来预测给定测试集的房屋数据价格。
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
file='file:///F:/Download sort required/train.csv'
data=pd.read_csv(file)
data.dropna(axis=0,subset=['SalePrice'],inplace=True)
y=data.SalePrice
predictors=['LotArea','OverallQual','GrLivArea','GarageCars','TotRmsAbvGrd','Neighborhood','HouseStyle','YearBuilt','ExterQual','KitchenQual']
One_hot_encoded_predictors=['Neighborhood','HouseStyle','YearBuilt','ExterQual','KitchenQual']
X_uncoded=data[predictors]
#Encoding the training data
X_uncoded=pd.get_dummies(X_uncoded,columns=One_hot_encoded_predictors)
X=X_uncoded
maxabsscaler=MaxAbsScaler()
X_max_abs=maxabsscaler.fit_transform(X)
model=RandomForestRegressor()
model.fit(X_max_abs,y)
test_file='file:///C:/Users/shand/Downloads/test.csv'
test_data=pd.read_csv(test_file)
X_uncoded_test=test_data[predictors]
X_uncoded_test=pd.get_dummies(X_uncoded_test,columns=One_hot_encoded_predictors)
X_test=X_uncoded_test
X_test.fillna(X_test.mean(),inplace=True)
X_max_abs_test=maxabsscaler.fit_transform(X_test)
predicted_prices=model.predict(X_max_abs_test)
my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
my_submission.to_csv('submission.csv', index=False)
我在分类功能上应用了一个热编码,后跟maxabsscaler转换,因为大多数数据从-1到1或0到1变化。但编译时的代码会引发以下错误 -
> > 28 X_test.fillna(X_test.mean(),inplace=True)
> 29 X_max_abs_test=maxabsscaler.fit_transform(X_test)
> ---> 30 predicted_prices=model.predict(X_max_abs_test)
> 31 my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
> 32 my_submission.to_csv('submission.csv', index=False)
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py
> in predict(self, X)
> 683 """
> 684 # Check data
> --> 685 X = self._validate_X_predict(X)
> 686
> 687 # Assign chunk of trees to jobs
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py
> in _validate_X_predict(self, X)
> 353 "call `fit` before exploiting the model.")
> 354
> --> 355 return self.estimators_[0]._validate_X_predict(X, check_input=True)
> 356
> 357 @property
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\tree\tree.py in
> _validate_X_predict(self, X, check_input)
> 374 "match the input. Model n_features is %s and "
> 375 "input n_features is %s "
> --> 376 % (self.n_features_, n_features))
> 377
> 378 return X
>
> ValueError: Number of features of the model must match the input.
> Model n_features is 158 and input n_features is 151
应用一个热编码和maxabsscaler后,有158个功能用于训练模型。 任何人都可以解释为什么我得到这个错误,虽然我对训练集和测试集数据应用了相同的转换? 我该怎么做才能纠正这个错误?
PS-Data来自 - https://www.kaggle.com/c/house-prices-advanced-regression-techniques
答案 0 :(得分:0)
正如您所提到的,列车中的列数和编码后的测试数据不同。 列车数据有158列,其中测试数据只有151列。
#Encoding the train data
X_uncoded=pd.get_dummies(X_uncoded,columns=One_hot_encoded_predictors)
X=X_uncoded
print(X.shape)
(1460, 158)
#Encoding the test data
X_uncoded_test=pd.get_dummies(X_uncoded_test,columns=One_hot_encoded_predictors)
print(X_uncoded_test.shape)
(1459, 151)
这可能是因为测试数据的级别数比列车数据少。请参阅以下pandas.get_dummies
中的示例import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
您可以考虑在编码之前组合训练和测试,然后将它们分离回训练并在编码后进行测试,如here所述