我正在尝试使用nba球员数据的DataFrame拟合一些线性回归模型,以预测目标变量“ ORPM”。但是,当以下代码运行时:
X = orpm_data.drop(['Player','Lg','ORPM'],axis=1)
y = orpm_data['ORPM']
linreg = Pipeline(steps=[
('preprocessor', preprocessor),
('linreg', LinearRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=99)
linreg.fit(X_train,y_train)
错误:
ValueError:“ ORPM”不在列表中
被提出。我想念什么?
编辑以回复评论:
print(X)打印整个无法显示的巨大数据框-但是X.info()返回以下内容:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 29 columns):
Tm 603 non-null object
Season 603 non-null object
Age 603 non-null int64
G 603 non-null int64
GS 603 non-null int64
MP 603 non-null int64
FG 603 non-null float64
FGA 603 non-null float64
2P 603 non-null float64
2PA 603 non-null float64
3P 603 non-null float64
3PA 603 non-null float64
FT 603 non-null float64
FTA 603 non-null float64
ORB 603 non-null float64
DRB 603 non-null float64
TRB 603 non-null float64
AST 603 non-null float64
STL 603 non-null float64
BLK 603 non-null float64
TOV 603 non-null float64
PF 603 non-null float64
PTS 603 non-null float64
FG% 600 non-null float64
2P% 600 non-null float64
3P% 517 non-null float64
eFG% 600 non-null float64
FT% 588 non-null float64
TS% 600 non-null float64
dtypes: float64(23), int64(4), object(2)
memory usage: 136.7+ KB
print(y)返回:
None
0 2.38
1 3.87
2 -1.21
3 1.58
4 -4.30
5 -0.62
...
598 -2.64
599 0.95
600 -2.98
601 -0.98
602 -2.08
Name: ORPM, Length: 603, dtype: float64
EDIT2 :有关预处理管道的更多详细信息
numeric_features = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA',
'2P', '2PA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST',
'STL', 'BLK', 'TOV', 'PF', 'PTS', 'FG%', '2P%', '3P%', 'eFG%', 'FT%',
'TS%','ORPM']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
('scaler', StandardScaler()),
('PCA', PCA())])
categorical_features = ['Tm', 'Season']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant',fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
此处有完整的错误堆栈跟踪:https://github.com/aj-1000/debugging-regression-model/blob/master/README.md
通过从numeric_features中删除“ ORPM”解决了问题-因为我在将数据传递到管道之前删除了该列。