我有一个pandas dataFrame,其中包含有关体育赛事的数据。假设您看到的是 winner_id , loser_id 和 match_id 。我正在尝试查找具有相同ID的上一个获奖者的最后一个索引。 预期的数据帧请参见此处:
d = {'winner':["A","B","C","A","A","C","B","D"], 'loser':["B","C","D","D","D","B","A","C"], 'id':[1,2,3,4,5,6,7,8], 'id_of_last_winner:' ["", 0, 1, 0, 3, 2, 5, 4]}
df = pd.DataFrame(d)
df
如果我遍历各列,则执行效果很差。我的代码的期望应该是这样的: id_of_last_winner :
以此类推...
因此,第一个直觉是我使用for循环遍历失败者列,然后将当前元素与失败者和获胜者列中的其他元素进行比较。它很简单,但是执行起来很糟糕,因为每个迭代都包含另外两个迭代。有没有更好的方法来加快该过程?
我满怀希望,因为我找到了
df['id_of_last_winner'] = data.groupby('winner')['id'].shift()
但是这仅检查失败者列。有更好的主意吗?预先感谢!
答案 0 :(得分:1)
您想要id
时有点困惑,但是预期的输出使用index
。这是使用id
的示例:
# create a list of players
players = list(set(df.winner).union(set(df.loser)) )
# create last game's id for each player
for player in players:
df[player] = df.id.where((df.winner==player) | (df.loser==player) ).\
ffill().shift()
# here's our result
df['winner_last_game'] = df.apply(lambda r: r[r.winner], axis=1)
显然,如果您有大量的玩家,它将无法正常工作,但对于几百名玩家来说,它应该可以工作。这是输出
+---+----+--------+-------+-------------------+-----+-----+-----+-----+------------------+
| | id | winner | loser | id_of_last_winner | A | C | D | B | winner_last_game |
+---+----+--------+-------+-------------------+-----+-----+-----+-----+------------------+
| 0 | 1 | A | B | | NaN | NaN | NaN | NaN | NaN |
| 1 | 2 | B | C | 0 | 1.0 | NaN | NaN | 1.0 | 1.0 |
| 2 | 3 | C | D | 1 | 1.0 | 2.0 | NaN | 2.0 | 2.0 |
| 3 | 4 | A | D | 0 | 1.0 | 3.0 | 3.0 | 2.0 | 1.0 |
| 4 | 5 | A | D | 3 | 4.0 | 3.0 | 4.0 | 2.0 | 4.0 |
| 5 | 6 | C | B | 2 | 5.0 | 3.0 | 5.0 | 2.0 | 3.0 |
| 6 | 7 | B | A | 5 | 5.0 | 6.0 | 5.0 | 6.0 | 6.0 |
| 7 | 8 | D | C | 4 | 7.0 | 6.0 | 5.0 | 7.0 | 5.0 |
+---+----+--------+-------+-------------------+-----+-----+-----+-----+------------------+