数据框将数据移动到随机列中?

时间:2019-03-19 06:10:05

标签: python pandas

我正在使用代码来移位与以下内容类似的时间序列数据:

Year    Player          PTSN    AVGN                               
2018    Aaron Donald    280.60  17.538  
2018    J.J. Watt       259.80  16.238  
2018    Danielle Hunter 237.60  14.850  
2017    Aaron Donald    181.0   12.929  
2016    Danielle Hunter 204.6   12.788

旨在将其变成这样:

                        AVGN   PTSN  AVGN_prev  PTSN_prev
Player          Year                                     
Aaron Donald    2016     NaN    NaN        NaN        NaN
                2017  12.929  181.0        NaN        NaN
                2018  17.538  280.6     12.929      181.0
Danielle Hunter 2016  12.788  204.6        NaN        NaN
                2017   8.325  133.2     12.788      204.6
                2018  14.850  237.6      8.325      133.2
J.J. Watt       2016     NaN    NaN        NaN        NaN
                2017     NaN    NaN        NaN        NaN
                2018  16.238  259.8        NaN        NaN

我正在使用以下代码来实现这一目标:

res = df.set_index(['player', 'Year'])

idx = pd.MultiIndex.from_product([df['player'].unique(), 
                                  df['Year'].unique()],
                                names=['Player', 'Year'])

res = res.groupby(['player', 'Year']).apply(sum)

res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)

添加了groupby.sum(),因为数据框中的某些球员在同一季节内从一个团队迁移到另一个团队,而我想合并这些数字。但是,我得到的数据实际上是极其错误的。该数据有太多列要发布,但似乎上一年度(_prev)的数据被放入随机列中。它不会改变,并且将始终放置在相同的错误列中。这是由groupby.sum()引起的问题吗?是因为我使用了column变量(包含与res.columns相同的所有名称,并附加了str(_prev))和列表(res.columns)?不管是什么,我该如何解决?

这是列和res.columns的输出:

列:

['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']

res_columns:

['player_id', 'position', 'player_game_count', 'team_name',
       'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
       'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
       'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
       'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
       'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
       'targets', 'receptions', 'yards', 'yards_per_reception',
       'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
       'pass_break_ups', 'qb_rating_against', 'penalties',
       'declined_penalties']

测试时都长35。

1 个答案:

答案 0 :(得分:2)

我建议使用:

#first aggregate for unique MultiIndex 
res = df.groupby(['Player', 'Year']).sum()

#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
                                names=['Player', 'Year'])
#aded new missing years 
res = res.reindex(idx).sort_index()

#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
                       PTSN    AVGN  PTSN_prev  AVGN_prev
Player          Year                                     
Aaron Donald    2016    NaN     NaN        NaN        NaN
                2017  181.0  12.929        NaN        NaN
                2018  280.6  17.538      181.0     12.929
Danielle Hunter 2016  204.6  12.788        NaN        NaN
                2017    NaN     NaN      204.6     12.788
                2018  237.6  14.850        NaN        NaN
J.J. Watt       2016    NaN     NaN        NaN        NaN
                2017    NaN     NaN        NaN        NaN
                2018  259.8  16.238        NaN        NaN