熊猫-加快数据框的缓慢功能/动作

时间:2019-09-22 16:56:34

标签: python pandas performance dataframe optimization

我之前曾问过一个问题(此处已正确回答):

link

为了简要总结一下,我有以下数据框:

| winner |  loser  | tournament |
+--------+---------+------------+
| John   | Steve   |      A     |
+--------+---------+------------+
| Steve  | John    |      B     |
+--------+---------+------------+
| John   | Michael |      A     |
+--------+---------+------------+
| Steve  | John    |      A     |
+--------+---------+------------+

我本来想以此结束:

+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| winner |  loser  | tournament | winner wins | loser wins | winner losses | loser losses | winner win % | loser win % |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  John  |  Steve  |      A     |      0      |      0     |       0       |       0      | 0/(0+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  Steve |   John  |      B     |      0      |      0     |       0       |       0      | 0/(0+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  John  | Michael |      A     |      1      |      0     |       0       |       0      | 1/(1+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  Steve |   John  |      A     |      0      |      2     |       1       |       0      | 0/(0+1)      | 2/(2+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------

建议的解决方案之一是这段代码:

def win_los_percent(sdf):
    sdf['winner wins'] = sdf.groupby('winner').cumcount()
    sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
    sdf['loser losses'] = sdf.groupby('loser').cumcount()
    sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
    sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
    sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
    return sdf

ddf = df.groupby('tournament').apply(win_los_percent)

这确实给出了正确的计算和答案。但是,我有一个很大的数据框,并且要花很长时间(> 10分钟)来运行它。

有人可以建议一种加快此功能的方法吗?一般来说,我对Pandas和numpy并不陌生,但是我读到的一种解决方案是使用矢量化。

我看不到矢量化这种功能的方法。有人可以指出我正确的方向吗?只要答案正确且合理地迅速完成,我就不介意为中间计算创建更多列。

谢谢

0 个答案:

没有答案
相关问题