我有一些带有一些列的pandas数据框。我想以不同的方式重新排列它们。一个例子如下:
time,name,feature,value
33 20 May 2016 14:00:00 -0700,John,badL,2
45 19 May 2016 18:00:00 -0700,John,badL,1
120 17 May 2016 11:00:00 -0700,John,badL,1
220 20 May 2016 14:00:00 -0700,John,totalL,20
450 19 May 2016 18:00:00 -0700,John,totalL,15
330 18 May 2016 15:00:00 -0700,Mary,badL,2
330 18 May 2016 15:00:00 -0700,Mary,totalL,20
550 21 May 2016 12:00:00 -0700,Mary,adCmd,4
700 22 May 2016 16:00:00 -0700,Mary,PC,3
800 22 May 2016 16:00:00 -0700,Mary,eCon,200
注意:第一列值(时间)前面是索引值(33,45,120,...)。 从上面的数据框中,我希望得到的数据框为:
time,name,badL,totalL,adCmd,PC,eCon
20 May 2016 14:00:00 -0700,John,2,20,0,0,0
19 May 2016 18:00:00 -0700,John,1,15,0,0,0
17 May 2016 11:00:00 -0700,John,1,0,0,0,0
18 May 2016 15:00:00 -0700,Mary,2,20,0,0,0
21 May 2016 12:00:00 -0700,Mary,0,0,4,0,0
22 May 2016 16:00:00 -0700,Mary,0,0,0,3,200
注意:对于17日,约翰没有任何总数。所以,用0填充它。
有优雅的方法吗?我将时间字段设置为pd.to_datetime,然后比较......看起来很乏味。对于上面的例子,我只有两个'功能'(badL,totalL)。我稍后会再来几个。
这就是我所拥有的 - 但是,它为第二个特征添加了一个不同的行...(totalL)....而不是将它放在同一行中。
for f in ['badL', 'totalL']:
dff = df[df.feature == f]
print dff
if len(dff.index) > 0:
fullFeatureDf[f] = dff.feature_value
答案 0 :(得分:2)
from StringIO import StringIO
import pandas as pd
text = '''time,name,f1,value
20 May 2016 14:00:00 -0700,John,badL,2
19 May 2016 18:00:00 -0700,John,badL,1
17 May 2016 11:00:00 -0700,John,badL,1
20 May 2016 14:00:00 -0700,John,totalL,20
19 May 2016 18:00:00 -0700,John,totalL,15
17 May 2016 11:00:00 -0700,John,totalL,12
'''
df = pd.read_csv(StringIO(text))
print df
time name f1 value
0 20 May 2016 14:00:00 -0700 John badL 2
1 19 May 2016 18:00:00 -0700 John badL 1
2 17 May 2016 11:00:00 -0700 John badL 1
3 20 May 2016 14:00:00 -0700 John totalL 20
4 19 May 2016 18:00:00 -0700 John totalL 15
5 17 May 2016 11:00:00 -0700 John totalL 12
unstack
df = df.set_index(['time', 'name', 'f1'])
print df
value
time name f1
20 May 2016 14:00:00 -0700 John badL 2
19 May 2016 18:00:00 -0700 John badL 1
17 May 2016 11:00:00 -0700 John badL 1
20 May 2016 14:00:00 -0700 John totalL 20
19 May 2016 18:00:00 -0700 John totalL 15
17 May 2016 11:00:00 -0700 John totalL 12
然后取消堆栈以执行枢轴。它占用行索引的一部分并将其移动为列。
print df.unstack()
value
f1 badL totalL
time name
17 May 2016 11:00:00 -0700 John 1 12
19 May 2016 18:00:00 -0700 John 1 15
20 May 2016 14:00:00 -0700 John 2 20
在精神上,这是与Yakym Pirozhenko完全相同的解决方案。这样做的方式略有不同。这对我来说更直观,但可能不适合你。
答案 1 :(得分:1)
这是df.pivot
的作业:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
'''
time,name,feature,value
33 20 May 2016 14:00:00 -0700,John,badL,2
45 19 May 2016 18:00:00 -0700,John,badL,1
120 17 May 2016 11:00:00 -0700,John,badL,1
220 20 May 2016 14:00:00 -0700,John,totalL,20
450 19 May 2016 18:00:00 -0700,John,totalL,15
330 18 May 2016 15:00:00 -0700,Mary,badL,2
330 18 May 2016 15:00:00 -0700,Mary,totalL,20
550 21 May 2016 12:00:00 -0700,Mary,adCmd,4
700 22 May 2016 16:00:00 -0700,Mary,PC,3
800 22 May 2016 16:00:00 -0700,Mary,eCon,200
'''), sep=',').set_index(['time', 'name'])
df_new = df.pivot(columns='feature').fillna(0).astype(int)
# value
# feature PC adCmd badL eCon totalL
# time name
# 120 17 May 2016 11:00:00 -0700 John 0 0 1 0 0
# 220 20 May 2016 14:00:00 -0700 John 0 0 0 0 20
# 33 20 May 2016 14:00:00 -0700 John 0 0 2 0 0
# 330 18 May 2016 15:00:00 -0700 Mary 0 0 2 0 20
# 45 19 May 2016 18:00:00 -0700 John 0 0 1 0 0
# 450 19 May 2016 18:00:00 -0700 John 0 0 0 0 15
# 550 21 May 2016 12:00:00 -0700 Mary 0 4 0 0 0
# 700 22 May 2016 16:00:00 -0700 Mary 3 0 0 0 0
# 800 22 May 2016 16:00:00 -0700 Mary 0 0 0 200 0