我的数据框如下所示:
PERIOD_START_TIME ID temp_ID value1 value2
06.28.2017 22:00:00 88 1 4 2
06.28.2017 22:00:00 88 2 0 7
06.28.2017 22:00:00 89 2 0 9
06.28.2017 22:00:00 89 1 5 4
06.28.2017 22:00:00 90 1 12 13
06.28.2017 22:00:00 90 2 18 4
现在我需要摆脱一半的行,但是要获得两倍的列。实际上,双列并将temp_ID分配给列的名称。简单地说,temp_id从行转换为列。
期望的输出
PERIOD_START_TIME ID value1_tpID1 vauel1_tpID2 vauel2_tpID1 value2_tpID2
06.28.2017 22:00:00 88 4 0 2 7
06.28.2017 22:00:00 89 5 0 4 9
06.28.2017 22:00:00 90 12 18 13 4
<class 'pandas.core.frame.DataFrame'>
Int64Index: 189604 entries, 0 to 10595
Data columns (total 12 columns):
PERIOD_START_TIME 189604 non-null object
ID 189604 non-null int64
temp_ID 189604 non-null int64
dtypes: float64(4), int64(6), object(2)
memory usage: 18.8+ MB
答案 0 :(得分:1)
#if necessary convert to str
df['temp_ID'] = df['temp_ID'].astype(str)
df = df.set_index(['PERIOD_START_TIME','ID','temp_ID']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
PERIOD_START_TIME ID value1_1 value1_2 value2_1 value2_2
0 06.28.2017 22:00:00 88 4 0 2 7
1 06.28.2017 22:00:00 89 5 0 4 9
2 06.28.2017 22:00:00 90 12 18 13 4
或者:
df = df.set_index(['PERIOD_START_TIME','ID','temp_ID']).unstack()
df.columns = ['_'.join((x[0], str(x[1]))) for x in df.columns]
df = df.reset_index()
print (df)
PERIOD_START_TIME ID value1_1 value1_2 value2_1 value2_2
0 06.28.2017 22:00:00 88 4 0 2 7
1 06.28.2017 22:00:00 89 5 0 4 9
2 06.28.2017 22:00:00 90 12 18 13 4
如果三元组PERIOD_START_TIME
,ID
,temp_ID
重复,则pivot_table
需要mean
,sum
这样的聚合函数... :
print (df)
PERIOD_START_TIME ID temp_ID value1 value2
0 06.28.2017 22:00:00 88 1 4 2 < same PERIOD_START_TIME ID temp_ID
1 06.28.2017 22:00:00 88 1 5 3 < same PERIOD_START_TIME ID temp_ID
2 06.28.2017 22:00:00 88 2 0 7
3 06.28.2017 22:00:00 89 2 0 9
4 06.28.2017 22:00:00 89 1 5 4
5 06.28.2017 22:00:00 90 1 12 13
6 06.28.2017 22:00:00 90 2 18 4
df = df.pivot_table(index=['PERIOD_START_TIME','ID'],
columns='temp_ID',
values=['value1','value2'],
aggfunc='mean')
df.columns = ['_'.join((x[0], str(x[1]))) for x in df.columns]
df = df.reset_index()
print (df)
PERIOD_START_TIME ID value1_1 value1_2 value2_1 value2_2
0 06.28.2017 22:00:00 88 4.5 0.0 2.5 7.0
1 06.28.2017 22:00:00 89 5.0 0.0 4.0 9.0
2 06.28.2017 22:00:00 90 12.0 18.0 13.0 4.0
替代解决方案:
df = df.groupby(['PERIOD_START_TIME','ID','temp_ID']).mean().unstack()
df.columns = ['_'.join((x[0], str(x[1]))) for x in df.columns]
df = df.reset_index()
print (df)
PERIOD_START_TIME ID value1_1 value1_2 value2_1 value2_2
0 06.28.2017 22:00:00 88 4.5 0.0 2.5 7.0
1 06.28.2017 22:00:00 89 5.0 0.0 4.0 9.0
2 06.28.2017 22:00:00 90 12.0 18.0 13.0 4.0