熊猫数据框矢量化/过滤:ValueError:只能比较标记相同的Series对象

时间:2019-11-02 01:17:09

标签: python pandas dataframe vectorization

我有两个具有NHL曲棍球统计数据的数据框。一个包含过去十年中每个团队所进行的每场比赛,而另一个则是我要在其中填充计算值的地方。简而言之,我想从一个团队的前五场比赛中得出一个指标,将其求和,然后将其放入另一个df中。我在以下修剪了我的df,以排除其他统计信息,并且只会查看一个统计信息。

df_all包含所有游戏:

>>> df_all
        season      gameId playerTeam opposingTeam  gameDate  xGoalsFor  xGoalsAgainst
1         2008  2008020001        NYR          T.B  20081004      2.287          2.689
6         2008  2008020003        NYR          T.B  20081005      1.793          0.916
11        2008  2008020010        NYR          CHI  20081010      1.938          2.762
16        2008  2008020019        NYR          PHI  20081011      3.030          3.020
21        2008  2008020034        NYR          N.J  20081013      1.562          3.454
...        ...         ...        ...          ...       ...        ...            ...
142576    2015  2015030185        L.A          S.J  20160422      2.927          2.042
142581    2017  2017030171        L.A          VGK  20180411      1.275          2.279
142586    2017  2017030172        L.A          VGK  20180413      1.907          4.642
142591    2017  2017030173        L.A          VGK  20180415      2.452          3.159
142596    2017  2017030174        L.A          VGK  20180417      2.427          1.818

df_sum_all将包含计算出的统计信息,现在它有一堆空列:

>>> df_sum_all
     season team  xg5  xg10  xg15  xg20
0      2008  NYR    0     0     0     0
1      2009  NYR    0     0     0     0
2      2010  NYR    0     0     0     0
3      2011  NYR    0     0     0     0
4      2012  NYR    0     0     0     0
..      ...  ...  ...   ...   ...   ...
327    2014  L.A    0     0     0     0
328    2015  L.A    0     0     0     0
329    2016  L.A    0     0     0     0
330    2017  L.A    0     0     0     0
331    2018  L.A    0     0     0     0

这是我的函数,用于计算xGoalsFor和xGoalsAgainst的比率。

def calcRatio(statfor, statagainst, games, season, team, statsdf):
    tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
    tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
    tempRatio = tempFor / tempAgainst
    return tempRatio

我相信这是合乎逻辑的。我输入了我想作为比例的统计数据,总结了多少场比赛,要配合的赛季和球队,然后从哪里获得统计数据。我已经分别测试了这些功能,并且知道可以很好地过滤并汇总统计信息,依此类推。这是tempFor计算的独立实现的示例:

>>> statsdf = df_all
>>> team = 'TOR'
>>> season = 2015
>>> games = 3
>>> tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
>>> print(tempFor)
8.618

看到了吗?它返回一个值。但是,我无法在整个数据框中执行相同的操作。我想念什么?我认为这实际上是针对每一行的工作方式,它将'xg5'列设置为calcRatio函数的输出,该函数使用该行的'season'和'team'对df_all进行过滤。

>>> df_sum_all['xg5'] = calcRatio('xGoalsFor','xGoalsAgainst',5,df_sum_all['season'], df_sum_all['team'], df_all)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in calcRatio
  File "/home/sebastian/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 1142, in wrapper
    raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects

干杯,谢谢您的帮助!

更新:我使用了iterrows(),并且效果很好,所以我一定不能很好地理解向量化。但是,它是相同的功能-为什么它以一种方式而不是另一种方式起作用?

>>> emptyseries = []
>>> for index, row in df_sum_all.iterrows():
...     emptyseries.append(calcRatio('xGoalsFor','xGoalsAgainst',5,row['season'],row['team'], df_all))
... 
>>> df_sum_all['xg5'] = emptyseries
__main__:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df_sum_all
     season team       xg5  xg10  xg15  xg20
0      2008  NYR  0.826260     0     0     0
1      2009  NYR  1.288390     0     0     0
2      2010  NYR  0.915942     0     0     0
3      2011  NYR  0.730498     0     0     0
4      2012  NYR  0.980744     0     0     0
..      ...  ...       ...   ...   ...   ...
327    2014  L.A  0.823998     0     0     0
328    2015  L.A  1.147412     0     0     0
329    2016  L.A  1.054947     0     0     0
330    2017  L.A  1.369005     0     0     0
331    2018  L.A  0.721411     0     0     0

[332 rows x 6 columns]

1 个答案:

答案 0 :(得分:1)

“ ValueError:只能比较标记相同的Series对象”

tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())

变量输入:

team: df_sum_all['team']
season: df_sum_all['season']
statsdf: df_all

因此在代码(statsdf.playerTeam == team)中,它将在 df_sum_all df_all 的系列之间进行比较。 如果这两个标签的标签不一致,则会看到上述错误。

相关问题