分组和过滤数据

时间:2015-04-14 20:06:13

标签: python pandas

数据框:

Protein   Peptide   Mean intensity  
A1        AAB       4,54             
A1        ABB       5,56             
A1        ABB       4,67                       
A1        AAB       5,67             
A1        ABC       5,67            
A2        ABB       4,64             
A2        AAB       4,54             
A2        ABB       5,56             
A2        ABC       4,67                        
A2        ABC       5,67            

但是我需要找到每种蛋白质前2(最常见)肽,所以输出将是A1:

Protein    Peptide   Mean intensity   
A1         AAB       4,54 + 5.67 / 2
           ABB       5.56 + 4.67 / 2
A2         ABB       7,42
           ABC       5,17

所以问题是它需要保留为数据帧。

1 个答案:

答案 0 :(得分:2)

首先,我们可以执行groupby / apply操作来获得每种蛋白质具有两个最大肽计数的蛋白质/肽对:

counts = (df.groupby(['Protein'])['Peptide']
            .apply(lambda x: x.value_counts().nlargest(2)))
counts = counts[counts >= 2]
counts = counts.to_frame()
#                  counts
# Protein Peptide        
# A1      AAB           2
#         ABB           2
# A2      ABB           2
#         ABC           2

现在,我们可以将原始DataFrame dfcounts合并,方法是加入df列和counts索引。 使用内连接可确保只有dfcounts中存在的那些蛋白/肽对显示在result中:

result = pd.merge(df, counts, left_on=['Protein', 'Peptide'], right_index=True,
                  how='inner')

#   Protein Peptide  Mean intensity  counts
# 0      A1     AAB            4.54       2
# 3      A1     AAB            5.67       2
# 1      A1     ABB            5.56       2
# 2      A1     ABB            4.67       2
# 5      A2     ABB            4.64       2
# 7      A2     ABB            5.56       2
# 8      A2     ABC            4.67       2
# 9      A2     ABC            5.67       2

现在可以轻松执行所需的groupby/mean操作:

result = result.groupby(['Protein', 'Peptide'])['Mean intensity'].mean()

所以把它们放在一起,

import pandas as pd
df = pd.read_table('data', sep='\s{2,}')

counts = (df.groupby(['Protein'])['Peptide']
            .apply(lambda x: x.value_counts().nlargest(2)))
counts = counts[counts >= 2]
counts = counts.to_frame()
result = pd.merge(df, counts, left_on=['Protein', 'Peptide'], right_index=True,
                  how='inner')
result = result.groupby(['Protein', 'Peptide'])['Mean intensity'].mean()
result = result.reset_index()
print(result)

产量

  Protein Peptide  Mean intensity
0      A1     AAB           5.105
1      A1     ABB           5.115
2      A2     ABB           5.100
3      A2     ABC           5.170