获取pandas数据透视表

时间:2015-06-18 03:19:55

标签: python pandas

我有以下数据框(pandas版本0.13.1)

>>> import pandas as pd
>>> DF = pd.DataFrame({'Group':['G1','G1','G2','G2'],'Start':['10','10','12','13'],'End':['13','13','14','15'],'Sample':['S1','S2','S3','S3'],'Status':['yes','yes','no','yes'],'pValue':[0.13,0.12,0.96,0.76],'pValueString':['13/100','12/100','96/100','76/100'],'desc':['aaaaaa','bbbbbb','aaaaaa','cccccc']})
>>> DF
  End Group Sample Start Status  pValue pValueString desc
0  13    G1     S1    10    yes    0.13       13/100 aaaaaa   
1  13    G1     S2    10     no    0.12       12/100 bbbbbb
2  14    G2     S3    12     no    0.96       96/100 aaaaaa
3  15    G2     S3    13    yes    0.76       76/100 cccccc

[4行x 8列]

到上面的数据框

  1. 我想分组' Group'。
  2. 然后通过Start-End对联分组。
  3. 旋转每个组的样本值。按max(pValue)
  4. 汇总
  5. 获取相应的状态,desc对应于具有较高pvalue的样本,并将其值替换为pValueString。
  6. 我需要最终将其改为以下格式

    Group Start End Sample           Status  desc
                        S1   S2
    G1    10    13    13/100 12/100  yes     aaaaaa
                        S3
    G2    12    14    96/100         no      aaaaaa
          13    15    76/100         yes     cccccc
    

    我曾尝试使用pivot_table和groupby,但无济于事。 任何帮助将不胜感激。

    我有

    grouped=DF.groupby('Group')
    for g,v in grouped:
        pandas.pivot_table(data=v,values=['pValue','pValueString']),rows= ['Group','Start','End'],cols=['Sample'])['pValueString']
    

    如何获得相应的desc和状态?

2 个答案:

答案 0 :(得分:2)

首先找到desc和Status的值:

groups = DF.groupby(['Group','Start','End'])
maxvals = groups.apply(lambda x: x.sort('pValue', ascending = False).head(1))
maxvals = maxvals[['Status','desc']].reset_index()    
maxvals
Out[69]: 
  Group Start End  level_3 Status    desc
0    G1    10  13        0    yes  aaaaaa
1    G2    12  14        2     no  aaaaaa
2    G2    13  15        3    yes  cccccc

单独创建数据透视表:

pvt = DF.pivot_table(index =['Group','Start','End'], 
                     columns = 'Sample', 
                     values = 'pValueString', 
                     aggfunc = max).reset_index()
pvt

Out[70]: 
Sample               S1      S2      S3
Group Start End                        
G1    10    13   13/100  12/100     NaN
G2    12    14      NaN     NaN  96/100
      13    15      NaN     NaN  76/100

最后将两者合并在一起

pd.merge(pvt, maxvals)
Out[73]: 
Sample Group Start End      S1      S2      S3  level_3 Status    desc
0         G1    10  13  13/100  12/100     NaN        0    yes  aaaaaa
1         G2    12  14     NaN     NaN  96/100        2     no  aaaaaa
2         G2    13  15     NaN     NaN  76/100        3    yes  cccccc

答案 1 :(得分:0)

创建具有最高pValues的组的索引(对于组内给定的开始和结束,每个样本):

idx = DF.groupby(['Group', 'Start', 'End']).pValue.agg(lambda x: x.idxmax())

使用此索引获取状态和desc:

a = DF.ix[idx][['Status', 'desc']]
>>> a
  Status    desc
0    yes  aaaaaa
2     no  aaaaaa
3    yes  cccccc

然后获得每组/样本的最大pValue(以数据透视表形式)。

b = DF.groupby(['Group', 'Start', 'End', 'Sample']).pValue.max().unstack()
>>> b
Sample             S1    S2    S3
Group Start End                  
G1    10    13   0.13  0.12   NaN
G2    12    14    NaN   NaN  0.96
      13    15    NaN   NaN  0.76

最后,将前一个DataFrame的索引设置为新的索引并加入。

a.index = b.index
df_new = b.join(a)
>>> df_new
                   S1    S2    S3 Status    desc
Group Start End                                 
G1    10    13   0.13  0.12   NaN    yes  aaaaaa
G2    12    14    NaN   NaN  0.96     no  aaaaaa
      13    15    NaN   NaN  0.76    yes  cccccc