根据pandas

时间:2017-06-29 10:22:50

标签: python pandas dataframe

我的数据框如下所示:

PERIOD_START_TIME       ID    temp_ID  value1  value2
06.28.2017 22:00:00     88      1        4       2
06.28.2017 22:00:00     88      2        0       7
06.28.2017 22:00:00     89      2        0       9
06.28.2017 22:00:00     89      1        5       4
06.28.2017 22:00:00     90      1        12      13
06.28.2017 22:00:00     90      2        18      4

现在我需要摆脱一半的行,但是要获得两倍的列。实际上,双列并将temp_ID分配给列的名称。简单地说,temp_id从行转换为列。

期望的输出

PERIOD_START_TIME    ID  value1_tpID1 vauel1_tpID2  vauel2_tpID1 value2_tpID2
06.28.2017 22:00:00  88          4       0            2            7
06.28.2017 22:00:00  89          5       0            4            9
06.28.2017 22:00:00  90          12      18           13           4

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189604 entries, 0 to 10595
Data columns (total 12 columns):
PERIOD_START_TIME         189604 non-null object
ID                       189604 non-null int64
temp_ID                  189604 non-null int64
dtypes: float64(4), int64(6), object(2)
memory usage: 18.8+ MB

1 个答案:

答案 0 :(得分:1)

您可以set_index使用unstack

#if necessary convert to str
df['temp_ID'] = df['temp_ID'].astype(str)
df = df.set_index(['PERIOD_START_TIME','ID','temp_ID']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
     PERIOD_START_TIME  ID  value1_1  value1_2  value2_1  value2_2
0  06.28.2017 22:00:00  88         4         0         2         7
1  06.28.2017 22:00:00  89         5         0         4         9
2  06.28.2017 22:00:00  90        12        18        13         4

或者:

df = df.set_index(['PERIOD_START_TIME','ID','temp_ID']).unstack()
df.columns = ['_'.join((x[0], str(x[1]))) for x in df.columns]
df = df.reset_index()
print (df)
     PERIOD_START_TIME  ID  value1_1  value1_2  value2_1  value2_2
0  06.28.2017 22:00:00  88         4         0         2         7
1  06.28.2017 22:00:00  89         5         0         4         9
2  06.28.2017 22:00:00  90        12        18        13         4

如果三元组PERIOD_START_TIMEIDtemp_ID重复,则pivot_table需要meansum这样的聚合函数... :

print (df)
     PERIOD_START_TIME  ID  temp_ID  value1  value2
0  06.28.2017 22:00:00  88        1       4       2 < same PERIOD_START_TIME  ID  temp_ID
1  06.28.2017 22:00:00  88        1       5       3 < same PERIOD_START_TIME  ID  temp_ID
2  06.28.2017 22:00:00  88        2       0       7
3  06.28.2017 22:00:00  89        2       0       9
4  06.28.2017 22:00:00  89        1       5       4
5  06.28.2017 22:00:00  90        1      12      13
6  06.28.2017 22:00:00  90        2      18       4

df = df.pivot_table(index=['PERIOD_START_TIME','ID'], 
                    columns='temp_ID', 
                    values=['value1','value2'],
                    aggfunc='mean')
df.columns = ['_'.join((x[0], str(x[1]))) for x in df.columns]
df = df.reset_index()
print (df)
     PERIOD_START_TIME  ID  value1_1  value1_2  value2_1  value2_2
0  06.28.2017 22:00:00  88       4.5       0.0       2.5       7.0
1  06.28.2017 22:00:00  89       5.0       0.0       4.0       9.0
2  06.28.2017 22:00:00  90      12.0      18.0      13.0       4.0

替代解决方案:

df = df.groupby(['PERIOD_START_TIME','ID','temp_ID']).mean().unstack()
df.columns = ['_'.join((x[0], str(x[1]))) for x in df.columns]
df = df.reset_index()
print (df)
     PERIOD_START_TIME  ID  value1_1  value1_2  value2_1  value2_2
0  06.28.2017 22:00:00  88       4.5       0.0       2.5       7.0
1  06.28.2017 22:00:00  89       5.0       0.0       4.0       9.0
2  06.28.2017 22:00:00  90      12.0      18.0      13.0       4.0