我正在使用代码来移位与以下内容类似的时间序列数据:
Year Player PTSN AVGN
2018 Aaron Donald 280.60 17.538
2018 J.J. Watt 259.80 16.238
2018 Danielle Hunter 237.60 14.850
2017 Aaron Donald 181.0 12.929
2016 Danielle Hunter 204.6 12.788
旨在将其变成这样:
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
我正在使用以下代码来实现这一目标:
res = df.set_index(['player', 'Year'])
idx = pd.MultiIndex.from_product([df['player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
res = res.groupby(['player', 'Year']).apply(sum)
res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)
添加了groupby.sum(),因为数据框中的某些球员在同一季节内从一个团队迁移到另一个团队,而我想合并这些数字。但是,我得到的数据实际上是极其错误的。该数据有太多列要发布,但似乎上一年度(_prev)的数据被放入随机列中。它不会改变,并且将始终放置在相同的错误列中。这是由groupby.sum()引起的问题吗?是因为我使用了column变量(包含与res.columns相同的所有名称,并附加了str(_prev))和列表(res.columns)?不管是什么,我该如何解决?
这是列和res.columns的输出:
列:
['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']
res_columns:
['player_id', 'position', 'player_game_count', 'team_name',
'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
'targets', 'receptions', 'yards', 'yards_per_reception',
'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
'pass_break_ups', 'qb_rating_against', 'penalties',
'declined_penalties']
测试时都长35。
答案 0 :(得分:2)
我建议使用:
#first aggregate for unique MultiIndex
res = df.groupby(['Player', 'Year']).sum()
#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
names=['Player', 'Year'])
#aded new missing years
res = res.reindex(idx).sort_index()
#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
PTSN AVGN PTSN_prev AVGN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 181.0 12.929 NaN NaN
2018 280.6 17.538 181.0 12.929
Danielle Hunter 2016 204.6 12.788 NaN NaN
2017 NaN NaN 204.6 12.788
2018 237.6 14.850 NaN NaN
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 259.8 16.238 NaN NaN