给出这样一个堆叠的DataFrame,其中每个变量有三种观察类型:
ID Variable Value
0 1056 Run Score 89
1 1056 Run Rank 56
2 1056 Run Decile 8
3 1056 Swim Score 92
4 1056 Swim Rank 64
5 1056 Swim Decile 8
6 1056 Cycle Score 96
7 1056 Cycle Rank 32
8 1056 Cycle Decile 9
我如何将其堆叠成这样:
Variable ID Decile Rank Score Event
0 1056 8 56 89 Run
0 1056 8 64 92 Swim
0 1056 9 32 96 Cycle
这是我目前的操作方式,但感觉过于复杂:
import pandas as pd
data = [(1056, "Run Score", 89),
(1056, "Run Rank", 56),
(1056, "Run Decile", 8),
(1056, "Swim Score", 92),
(1056, "Swim Rank", 64),
(1056, "Swim Decile", 8),
(1056, "Cycle Score", 96),
(1056, "Cycle Rank", 32),
(1056, "Cycle Decile", 9)]
cols = ["ID", "Variable", "Value"]
all_data = pd.DataFrame(data=data, columns=cols)
event_names = ["Run", "Swim", "Cycle"]
event_data_all = []
for event_name in event_names:
event_data = all_data.loc[all_data["Variable"].str.startswith(event_name)]
event_data = event_data.pivot_table(index="ID", columns="Variable", values="Value", aggfunc=pd.np.sum)
event_data.reset_index(inplace=True)
event_data.rename(columns={
event_name + " Score": "Score",
event_name + " Rank": "Rank",
event_name + " Decile": "Decile"
}, inplace=True)
event_data["Event"] = event_name
event_data_all.append(event_data)
all_data_final = pd.concat(event_data_all)
有更好的方法吗?
答案 0 :(得分:3)
想法是创建新的2列,并通过split
将其用于透视:
all_data = all_data.loc[all_data["Variable"].str.startswith(tuple(event_names))]
all_data[['Event','b']] = all_data['Variable'].str.split(expand=True)
df=all_data.set_index(['ID','Event','b'])['Value'].unstack().reset_index().rename_axis(None,1)
print (df)
ID Event Decile Rank Score
0 1056 Cycle 9 32 96
1 1056 Run 8 56 89
2 1056 Swim 8 64 92
感谢@asongtoruin提供另一种解决方案,尤其是在需要汇总数据的情况下:
all_data.pivot_table(index=['ID', 'Event'],
columns='b',
values='Value',
aggfunc='sum').reset_index().rename_axis(None, 1))
另一种解决方案是event_names
的{{3}}:
event_names = ["Run", "Swim", "Cycle"]
all_data = all_data.loc[all_data["Variable"].str.startswith(tuple(event_names))]
pat = '(' + '|'.join(event_names) + ')\s+(.*)'
all_data[['Event','b']] = all_data['Variable'].str.extract(pat)
df = (all_data.pivot_table(index=['ID', 'Event'],
columns='b',
values='Value',
aggfunc='sum').reset_index().rename_axis(None, 1))
print (df)
ID Event Decile Rank Score
0 1056 Cycle 9 32 96
1 1056 Run 8 56 89
2 1056 Swim 8 64 92