有没有一种方法可以更有效地遍历熊猫数据框中的行?

时间:2018-12-07 19:48:27

标签: python pandas

我有一个巨大的熊猫数据框,其中每一行对应一个体育比赛。看起来如下:

**编辑:我将更改示例代码以更好地反映实际数据: 这让我意识到除了“迷失”或“赢取”之外的其他价值观的存在使这变得更加困难。

d = {'date': ['21.01.96', '22.02.96', '23.02.96', '24.02.96', '25.02.96',
          '26.02.96', '27.02.96', '28.02.96', '29.02.96', '30.02.96'], 
     'challenger': [5, 5, 10, 5, 4, 5, 8, 8, 10, 8],
     'opponent': [2, 4, 5, 4, 5, 10, 5, 2, 4, 10],
     'outcome': ['win', 'lost', 'declined', 'win', 'declined', 'win', 'declined', 'declined', 'lost', 'lost']
     }
df = pd.DataFrame(data=d)

对于每个比赛,我想计算一个新变量的先前赢/输。在示例情况下,“ prev_wins”变量将为[0,0,0,1,0,0,0,0,0,0]。我确实设法为此创建了工作代码,如下所示:

data['prev_wins_spec_challenger'] = 0
data['prev_losses_spec_challenger'] = 0               

data['challenger'] = data['challenger'].astype(str)
data['opponent'] = data['opponent'].astype(str)

data['matchups'] = data['challenger'] + '-' + data['opponent']

# create list of matchups with unique pairings
matchups_temp = list(data['matchups'].unique())
matchups = []
for match in matchups_temp:
    if match[::-1] in matchups:
        pass
    else:
        matchups.append(match)

prev_wins = {}
for i in matchups:
    prev_wins[i] = 0

prev_losses = {}
for i in matchups:
    prev_losses[i] = 0

# go through data set for each matchup and calculate variables
for i in range(0, len(matchups)):
    match = matchups[i].split('-')
    challenger = match[0]
    opponent = match[1]
    for index, row in data.iterrows():
        if row['challenger'] == challenger and row['opponent'] == opponent:
            if row['outcome'] == 'won':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
        elif row['challenger'] == opponent and row['opponent'] == challenger:
            if row['outcome'] == 'won':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1

问题是这花费了非常长的时间,因为总共有〜65.000个不同的匹配,并且数据帧有〜170.000行。在我的笔记本电脑上,这大约需要180个小时才能运行,这是不可接受的。

我敢肯定有一个更好的解决方案,但是即使整天都在搜索互联网,我仍然找不到一个解决方案。如何使此代码更快?

2 个答案:

答案 0 :(得分:2)

IIUC,groupbycumsum

df['outcome'] = df.outcome.map({'win':1, 'loss':0})

然后

df.groupby('challenger').outcome.cumsum().sub(1).clip(lower=0)

当然,您不需要覆盖outcome中的值(您可以创建一个新列并使用它)。但通常在大熊猫中,使用int的操作要比使用string的操作快得多。因此,从性能的角度来看,与实际的单词01相比,最好让losswin代表胜利和失败。

在最后一层,当您展示信息时,即映射回人类可以理解的单词。但是处理通常不需要字符串

答案 1 :(得分:0)

IIUC,您可以执行以下操作,使用shift()查看先前的结果,并获取等于win的布尔值的累积和:

data['previous_wins'] = data.groupby('challenger').outcome.transform(lambda x: x.shift().eq('win').cumsum())

>>> data
   challenger      date  opponent outcome  previous_wins
0           5  21.01.96         6     win              0
1           4  22.02.96         3    loss              0
2           5  23.02.96         6     win              1

如果您要计算挑战者对特定对手的获胜次数,则可以按挑战者和对手进行分组:

data['previous_wins'] = data.groupby(['opponent','challenger']).outcome.transform(lambda x: x.shift().eq('win').cumsum())