我有一个Pandas DataFrame,其中包含多年以来一项运动的选手数据。注意:一个球员可以在同一赛季参加多个联赛。这是DataFrame的示例:
import pandas as pd
from io import StringIO
s = '''\
PlayerName,Year,League,Points
Player1,2010,LeagueA,10
Player1,2010,LeagueB,20
Player1,2011,LeagueC,30
'''
df = pd.read_csv(StringIO(s))
外观如下:
PlayerName Year League Points
0 Player1 2010 LeagueA 10
1 Player1 2010 LeagueB 20
2 Player1 2011 LeagueC 30
现在,我想创建一个新的DataFrame或重新格式化现有的DataFrame,以创建他们参加的联赛的成对比较。比较必须来自同一年或一年之内,并且不能有任何重复的配对。例如,我要结束的DataFrame看起来像这样:
Player Name Year 1 League 1 Points 1 Year 2 League 2 Points 2
Player 1 2010 League A 10 2010 League B 20
Player 1 2010 League A 10 2011 League C 30
Player 1 2010 League B 20 2011 League C 30
我目前对此的想法是:
df = data
df1 = df.drop_duplicates(subset=['Player Name', 'Year'], keep='first')
df2 = df.drop_duplicates(subset=['Player Name', 'Year'], keep='last')
merged_df1 = df.merge(df1, on='Player Name')
merged_df2 = df.merge(df2, on='Player Name')
temp = [merged_df1, merged_df2]
combined_df = pd.concat(temp)
combined_df = combined_df.drop_duplicates(subset='Player Name', keep='first')
combined_df['Year Difference'] = combined_df['Year_x'] - combined_df['Year_y']
combined_df = combined_df.loc[(combined_df['Year Difference'] >= -1) & (combined_df['Year Difference'] <=1]
有更好的方法吗?我觉得这段代码相当庞大,并且会产生错误。任何帮助将不胜感激。