我正在为一个班级制作项目,我正在尝试使用线性回归预测nfl socre游戏并预测sklearn中的函数,当我想将训练数据拟合到de fit函数时,我的问题出现了,这是我的代码:
onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])
# Crea el object de regression linear
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
这是dataframe(goal_model_data)的结构:
team opponent goals home
NE KC 27 1
BUF NYJ 21 1
CHI ATL 17 1
CIN BAL 0 1
CLE PIT 18 1
DET ARI 35 1
HOU JAX 7 1
TEN OAK 16 1
这是我运行程序时遇到的错误:
Traceback (most recent call last):
File "predictnflgames.py", line 76, in <module>
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['team' 'opponent'] not in index"
答案 0 :(得分:2)
问题是在pd.get_dummies
之后没有team
和opponent
列。
我以txt格式使用此数据作为我的示例:https://ufile.io/e2vtv(与您的相同)。
试一试,看看:
import pandas as pd
from sklearn.linear_model import LinearRegression
goal_model_data = pd.read_table('goal_model_data.txt', delim_whitespace=True)
onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])
regr = LinearRegression()
#see the columns in onehotdata_x1
onehotdata_x1.columns
#see the data (only 2 rows of the data for the example)
onehotdata_x1.head(2)
<强>结果:强>
Index([u'goals', u'home', u'team_BUF', u'team_CHI', u'team_CIN', u'team_CLE',
u'team_DET', u'team_HOU', u'team_NE', u'team_TEN', u'opponent_ARI',
u'opponent_ATL', u'opponent_BAL', u'opponent_JAX', u'opponent_KC',
u'opponent_NYJ', u'opponent_OAK', u'opponent_PIT'],
dtype='object')
goals home team_BUF team_CHI team_CIN team_CLE team_DET team_HOU \
0 27 1 0 0 0 0 0 0
1 21 1 1 0 0 0 0 0
team_NE team_TEN opponent_ARI opponent_ATL opponent_BAL opponent_JAX \
0 1 0 0 0 0 0
1 0 0 0 0 0 0
opponent_KC opponent_NYJ opponent_OAK opponent_PIT
0 1 0 0 0
1 0 1 0 0
编辑1
根据原始代码,您可能希望执行以下操作:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = pd.read_table('data.txt', delim_whitespace=True)
onehotdata = pd.get_dummies(data,columns=['team','opponent'])
regr = LinearRegression()
#in x get all columns except goals column
x = onehotdata.loc[:, onehotdata.columns != 'goals']
#use goals column as target variable
y= onehotdata['goals']
regr.fit(x,y)
regr.predict(x)
希望这有帮助。
答案 1 :(得分:-1)
当您使用pd.get_dummies(goal_model_data,columns=['team','opponent'])
时,team
和opponent
列将从您的数据框中删除,onehotdata_x1
将不包含这两列。
然后,当您执行onehotdata_x1[['home','team','opponent']]
时,您只会因为KeyError
和team
不存在opponent
数据框中的列而获得onehotdata_x1
。
使用玩具数据框,这是发生的事情: