我有以下数据框(pandas版本0.13.1)
>>> import pandas as pd
>>> DF = pd.DataFrame({'Group':['G1','G1','G2','G2'],'Start':['10','10','12','13'],'End':['13','13','14','15'],'Sample':['S1','S2','S3','S3'],'Status':['yes','yes','no','yes'],'pValue':[0.13,0.12,0.96,0.76],'pValueString':['13/100','12/100','96/100','76/100'],'desc':['aaaaaa','bbbbbb','aaaaaa','cccccc']})
>>> DF
End Group Sample Start Status pValue pValueString desc
0 13 G1 S1 10 yes 0.13 13/100 aaaaaa
1 13 G1 S2 10 no 0.12 12/100 bbbbbb
2 14 G2 S3 12 no 0.96 96/100 aaaaaa
3 15 G2 S3 13 yes 0.76 76/100 cccccc
[4行x 8列]
到上面的数据框
我需要最终将其改为以下格式
Group Start End Sample Status desc
S1 S2
G1 10 13 13/100 12/100 yes aaaaaa
S3
G2 12 14 96/100 no aaaaaa
13 15 76/100 yes cccccc
我曾尝试使用pivot_table和groupby,但无济于事。 任何帮助将不胜感激。
我有
grouped=DF.groupby('Group')
for g,v in grouped:
pandas.pivot_table(data=v,values=['pValue','pValueString']),rows= ['Group','Start','End'],cols=['Sample'])['pValueString']
如何获得相应的desc和状态?
答案 0 :(得分:2)
首先找到desc和Status的值:
groups = DF.groupby(['Group','Start','End'])
maxvals = groups.apply(lambda x: x.sort('pValue', ascending = False).head(1))
maxvals = maxvals[['Status','desc']].reset_index()
maxvals
Out[69]:
Group Start End level_3 Status desc
0 G1 10 13 0 yes aaaaaa
1 G2 12 14 2 no aaaaaa
2 G2 13 15 3 yes cccccc
单独创建数据透视表:
pvt = DF.pivot_table(index =['Group','Start','End'],
columns = 'Sample',
values = 'pValueString',
aggfunc = max).reset_index()
pvt
Out[70]:
Sample S1 S2 S3
Group Start End
G1 10 13 13/100 12/100 NaN
G2 12 14 NaN NaN 96/100
13 15 NaN NaN 76/100
最后将两者合并在一起
pd.merge(pvt, maxvals)
Out[73]:
Sample Group Start End S1 S2 S3 level_3 Status desc
0 G1 10 13 13/100 12/100 NaN 0 yes aaaaaa
1 G2 12 14 NaN NaN 96/100 2 no aaaaaa
2 G2 13 15 NaN NaN 76/100 3 yes cccccc
答案 1 :(得分:0)
创建具有最高pValues的组的索引(对于组内给定的开始和结束,每个样本):
idx = DF.groupby(['Group', 'Start', 'End']).pValue.agg(lambda x: x.idxmax())
使用此索引获取状态和desc:
a = DF.ix[idx][['Status', 'desc']]
>>> a
Status desc
0 yes aaaaaa
2 no aaaaaa
3 yes cccccc
然后获得每组/样本的最大pValue(以数据透视表形式)。
b = DF.groupby(['Group', 'Start', 'End', 'Sample']).pValue.max().unstack()
>>> b
Sample S1 S2 S3
Group Start End
G1 10 13 0.13 0.12 NaN
G2 12 14 NaN NaN 0.96
13 15 NaN NaN 0.76
最后,将前一个DataFrame的索引设置为新的索引并加入。
a.index = b.index
df_new = b.join(a)
>>> df_new
S1 S2 S3 Status desc
Group Start End
G1 10 13 0.13 0.12 NaN yes aaaaaa
G2 12 14 NaN NaN 0.96 no aaaaaa
13 15 NaN NaN 0.76 yes cccccc