我有一个包含如下内容的数据框:
# Year Player PTSN AVGN
ThisYear
2018Aaron Donald 1 2018 Aaron Donald 280.60 17.538
2018J.J. Watt 2 2018 J.J. Watt 259.80 16.238
2018Danielle Hunter 3 2018 Danielle Hunter 237.60 14.850
2017Aaron Donald 8 2017 Aaron Donald 181.0 12.929
2017Danielle Hunter 20 2017 Danielle Hunter 133.2 8.325
2016Danielle Hunter 2 2016 Danielle Hunter 204.6 12.788
我想做的是调整列和行,以便我可以对他们进行回归以比较每个球员的前一年与他们的未来一年(如果您知道做我想问的更好的方法,请告诉我)。
我正在寻找的最终结果将是这样的:
Player PTSN AVGN PTSNN1 AVGNN1
Aaron Donald 280.60 17.538 181.0 12.929
我该怎么做?或者,要达到我想要的结果的更好的方法是什么?
答案 0 :(得分:2)
# Same setup
df = pd.DataFrame({'#': [1, 2, 3, 8, 20, 2],
'AVGN': [17.538, 16.238, 14.85, 12.929, 8.325, 12.788],
'PTSN': [280.6, 259.8, 237.6, 181.0, 133.2, 204.6],
'Player': ['Aaron Donald',
'J.J. Watt',
'Danielle Hunter',
'Aaron Donald',
'Danielle Hunter',
'Danielle Hunter'],
'Year': [2018, 2018, 2018, 2017, 2017, 2016]})
# Do not unstack the MultiIndex
res = df.set_index(['Player', 'Year'])[['AVGN', 'PTSN']]
# Build a MultiIndex of all players by all years
idx = pd.MultiIndex.from_product([df['Player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
# Introduce a row of NaN values for any combination of
# player and year not in the original DataFrame
res = res.reindex(idx).sort_index()
res[['AVGN_prev', 'PTSN_prev']] = res.groupby('Player')[['AVGN', 'PTSN']].shift()
res
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
将索引设置为['Player', 'Year']
,然后将内部级别堆积为列:
# Simplified version of your example DataFrame
df = pd.DataFrame({'#': [1, 2, 3, 8, 20, 2],
'AVGN': [17.538, 16.238, 14.85, 12.929, 8.325, 12.788],
'PTSN': [280.6, 259.8, 237.6, 181.0, 133.2, 204.6],
'Player': ['Aaron Donald',
'J.J. Watt',
'Danielle Hunter',
'Aaron Donald',
'Danielle Hunter',
'Danielle Hunter'],
'Year': [2018, 2018, 2018, 2017, 2017, 2016]})
res = df.set_index(['Player', 'Year'])[['AVGN', 'PTSN']].unstack()
res
AVGN PTSN
Year 2016 2017 2018 2016 2017 2018
Player
Aaron Donald NaN 12.929 17.538 NaN 181.0 280.6
Danielle Hunter 12.788 8.325 14.850 204.6 133.2 237.6
J.J. Watt NaN NaN 16.238 NaN NaN 259.8
此时,这些列是一个MultiIndex。扁平化列:
# Convert integer years to strings
oldcols = res.columns
res.columns = oldcols.set_levels([oldcols.levels[0],
oldcols.levels[1].astype(str)])
# Flatten columns
res.columns = ['_'.join(col) for col in res.columns.values]
res
AVGN_2016 AVGN_2017 AVGN_2018 PTSN_2016 PTSN_2017 PTSN_2018
Player
Aaron Donald NaN 12.929 17.538 NaN 181.0 280.6
Danielle Hunter 12.788 8.325 14.850 204.6 133.2 237.6
J.J. Watt NaN NaN 16.238 NaN NaN 259.8