对每个变量进行多种类型的观察的解包熊猫数据框

时间:2018-08-15 09:03:28

标签: python pandas

给出这样一个堆叠的DataFrame,其中每个变量有三种观察类型:

     ID      Variable  Value
0  1056    Run Score   89
1  1056    Run Rank    56
2  1056    Run Decile  8
3  1056    Swim Score  92
4  1056    Swim Rank   64
5  1056    Swim Decile 8
6  1056    Cycle Score 96
7  1056    Cycle Rank  32
8  1056    Cycle Decile    9

我如何将其堆叠成这样:

Variable    ID  Decile  Rank  Score  Event
0         1056       8    56     89    Run
0         1056       8    64     92   Swim
0         1056       9    32     96  Cycle

这是我目前的操作方式,但感觉过于复杂:

import pandas as pd

data = [(1056, "Run Score", 89),
    (1056, "Run Rank", 56),
    (1056, "Run Decile", 8),
    (1056, "Swim Score", 92),
    (1056, "Swim Rank", 64),
    (1056, "Swim Decile", 8),
    (1056, "Cycle Score", 96),
    (1056, "Cycle Rank", 32),
    (1056, "Cycle Decile", 9)]

cols = ["ID", "Variable", "Value"]

all_data = pd.DataFrame(data=data, columns=cols)

event_names = ["Run", "Swim", "Cycle"]

event_data_all = []

for event_name in event_names:
    event_data = all_data.loc[all_data["Variable"].str.startswith(event_name)]
    event_data = event_data.pivot_table(index="ID", columns="Variable", values="Value", aggfunc=pd.np.sum)
    event_data.reset_index(inplace=True)
    event_data.rename(columns={
        event_name + " Score": "Score",
        event_name + " Rank": "Rank",
        event_name + " Decile": "Decile"
    }, inplace=True)
    event_data["Event"] = event_name
    event_data_all.append(event_data)

all_data_final = pd.concat(event_data_all)

有更好的方法吗?

1 个答案:

答案 0 :(得分:3)

想法是创建新的2列,并通过split将其用于透视:

all_data = all_data.loc[all_data["Variable"].str.startswith(tuple(event_names))]
all_data[['Event','b']] = all_data['Variable'].str.split(expand=True)

df=all_data.set_index(['ID','Event','b'])['Value'].unstack().reset_index().rename_axis(None,1)
print (df)
     ID  Event  Decile  Rank  Score
0  1056  Cycle       9    32     96
1  1056    Run       8    56     89
2  1056   Swim       8    64     92

感谢@ason​​gtoruin提供另一种解决方案,尤其是在需要汇总数据的情况下:

all_data.pivot_table(index=['ID', 'Event'], 
                     columns='b',
                     values='Value', 
                     aggfunc='sum').reset_index().rename_axis(None, 1))

另一种解决方案是event_names的{​​{3}}:

event_names = ["Run", "Swim", "Cycle"]
all_data = all_data.loc[all_data["Variable"].str.startswith(tuple(event_names))]
pat = '(' + '|'.join(event_names) + ')\s+(.*)'
all_data[['Event','b']] = all_data['Variable'].str.extract(pat)

df = (all_data.pivot_table(index=['ID', 'Event'], 
                          columns='b', 
                          values='Value', 
                          aggfunc='sum').reset_index().rename_axis(None, 1))
print (df)

     ID  Event  Decile  Rank  Score
0  1056  Cycle       9    32     96
1  1056    Run       8    56     89
2  1056   Swim       8    64     92