Question

我正在为一个班级制作项目，我正在尝试使用线性回归预测nfl socre游戏并预测sklearn中的函数，当我想将训练数据拟合到de fit函数时，我的问题出现了，这是我的代码：

onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])

# Crea el object de regression linear
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])

这是dataframe（goal_model_data）的结构：

team opponent  goals  home
 NE       KC     27     1
BUF      NYJ     21     1
CHI      ATL     17     1
CIN      BAL      0     1
CLE      PIT     18     1
DET      ARI     35     1
HOU      JAX      7     1
TEN      OAK     16     1

这是我运行程序时遇到的错误：

Traceback (most recent call last):
  File "predictnflgames.py", line 76, in <module>
    regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
    return self._getitem_array(key)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2177, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1269, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['team' 'opponent'] not in index"

Answer 1

问题是在pd.get_dummies之后没有team和opponent列。

我以txt格式使用此数据作为我的示例：https://ufile.io/e2vtv（与您的相同）。

试一试，看看：

import pandas as pd
from sklearn.linear_model import LinearRegression

goal_model_data = pd.read_table('goal_model_data.txt', delim_whitespace=True)

onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])

regr = LinearRegression()

#see the columns in onehotdata_x1
onehotdata_x1.columns

#see the data (only 2 rows of the data for the example)
onehotdata_x1.head(2)

<强>结果：

Index([u'goals', u'home', u'team_BUF', u'team_CHI', u'team_CIN', u'team_CLE',
       u'team_DET', u'team_HOU', u'team_NE', u'team_TEN', u'opponent_ARI',
       u'opponent_ATL', u'opponent_BAL', u'opponent_JAX', u'opponent_KC',
       u'opponent_NYJ', u'opponent_OAK', u'opponent_PIT'],
       dtype='object')

goals  home  team_BUF  team_CHI  team_CIN  team_CLE  team_DET  team_HOU  \
0     27     1         0         0         0         0         0         0
1     21     1         1         0         0         0         0         0

team_NE  team_TEN  opponent_ARI  opponent_ATL  opponent_BAL  opponent_JAX  \
0        1         0             0             0             0             0
1        0         0             0             0             0             0

opponent_KC  opponent_NYJ  opponent_OAK  opponent_PIT
0            1             0             0             0
1            0             1             0             0

编辑1

根据原始代码，您可能希望执行以下操作：

import pandas as pd
from sklearn.linear_model import LinearRegression

data = pd.read_table('data.txt', delim_whitespace=True)

onehotdata = pd.get_dummies(data,columns=['team','opponent'])

regr = LinearRegression()

#in x get all columns except goals column
x = onehotdata.loc[:, onehotdata.columns != 'goals']

#use goals column as target variable
y= onehotdata['goals']

regr.fit(x,y)
regr.predict(x)

希望这有帮助。

Answer 2

当您使用pd.get_dummies(goal_model_data,columns=['team','opponent'])时，team和opponent列将从您的数据框中删除，onehotdata_x1将不包含这两列。

然后，当您执行onehotdata_x1[['home','team','opponent']]时，您只会因为KeyError和team不存在opponent数据框中的列而获得onehotdata_x1。

使用玩具数据框，这是发生的事情：

将数据框拟合为线性回归sklearn

2 个答案: