数据框:
Protein Peptide Mean intensity
A1 AAB 4,54
A1 ABB 5,56
A1 ABB 4,67
A1 AAB 5,67
A1 ABC 5,67
A2 ABB 4,64
A2 AAB 4,54
A2 ABB 5,56
A2 ABC 4,67
A2 ABC 5,67
但是我需要找到每种蛋白质前2(最常见)肽,所以输出将是A1:
Protein Peptide Mean intensity
A1 AAB 4,54 + 5.67 / 2
ABB 5.56 + 4.67 / 2
A2 ABB 7,42
ABC 5,17
所以问题是它需要保留为数据帧。
答案 0 :(得分:2)
首先,我们可以执行groupby / apply操作来获得每种蛋白质具有两个最大肽计数的蛋白质/肽对:
counts = (df.groupby(['Protein'])['Peptide']
.apply(lambda x: x.value_counts().nlargest(2)))
counts = counts[counts >= 2]
counts = counts.to_frame()
# counts
# Protein Peptide
# A1 AAB 2
# ABB 2
# A2 ABB 2
# ABC 2
现在,我们可以将原始DataFrame df
与counts
合并,方法是加入df
列和counts
索引。
使用内连接可确保只有df
和counts
中存在的那些蛋白/肽对显示在result
中:
result = pd.merge(df, counts, left_on=['Protein', 'Peptide'], right_index=True,
how='inner')
# Protein Peptide Mean intensity counts
# 0 A1 AAB 4.54 2
# 3 A1 AAB 5.67 2
# 1 A1 ABB 5.56 2
# 2 A1 ABB 4.67 2
# 5 A2 ABB 4.64 2
# 7 A2 ABB 5.56 2
# 8 A2 ABC 4.67 2
# 9 A2 ABC 5.67 2
现在可以轻松执行所需的groupby/mean
操作:
result = result.groupby(['Protein', 'Peptide'])['Mean intensity'].mean()
所以把它们放在一起,
import pandas as pd
df = pd.read_table('data', sep='\s{2,}')
counts = (df.groupby(['Protein'])['Peptide']
.apply(lambda x: x.value_counts().nlargest(2)))
counts = counts[counts >= 2]
counts = counts.to_frame()
result = pd.merge(df, counts, left_on=['Protein', 'Peptide'], right_index=True,
how='inner')
result = result.groupby(['Protein', 'Peptide'])['Mean intensity'].mean()
result = result.reset_index()
print(result)
产量
Protein Peptide Mean intensity
0 A1 AAB 5.105
1 A1 ABB 5.115
2 A2 ABB 5.100
3 A2 ABC 5.170