我有一个多层次的问题。有两个表:
表1样本信息:
Sample Compound Label Abundance
1 ABC 0 10
1 ABC 1 50
2 ABC 0 100
2 ABC 0 5
3 ABC 0 100
4 ABC 0 5
1 DEF 0 10
1 DEF 1 50
1 DEF 2 100
2 DEF 0 5
3 DEF 0 100
3 DEF 1 5
表2同类群组信息:
Sample Cohort
1 control
2 control
3 disease
4 disease
我有三个任务:a)对表1中的每个样本的总丰度求和以产生类似的结果
Sample Compound Sum_Abundance
1 ABC 60
2 ABC 105
3 ABC 100
4 ABC 5
b)将它们与表2合并,以创建具有同类群组信息的列:
Sample Compound Sum_Abundance Cohort Info
1 ABC 60 control
2 ABC 105 control
3 ABC 100 disease
4 ABC 5 disease
c)同类群组中每种化合物的平均总和
Compound Avg_Abundance Cohort Info
ABC 82.5 control
ABC 57.5 disease
我尝试了以下步骤:
pivot_table=pd.pivot_table(table1, values=['Abundance'], index=['Sample', 'Name'], aggfunc = np.sum)
print(table1.head(2))
sum_table = pd.DataFrame(pivot_table)
cohort_df = pd.DataFrame(table2)
print(cohort_df.head())
merged_df = pd.merge(sum_table, cohort_df, on = "Sample")
这是合并两个框架但删除复合列的地方,无论我尝试什么,我都无法通过。如果我在列中输入“名称”,它将创建一个漂亮的输出,但是我不知道如何对字段求平均。
答案 0 :(得分:1)
这就是我要做的:
step1 = (df1.groupby(['Sample','Compound'])
['Abundance'].sum()
.reset_index(name='Sum_Abundance')
)
step2 = step1.merge(df2, on='Sample')
step3 = (step2.groupby(['Compound','Cohort'])
['Sum_Abundance'].mean()
.reset_index(name='Avg_Abundance')
)
输出:
Compound Cohort Avg_Abundance
0 ABC control 82.5
1 ABC disease 52.5
2 DEF control 82.5
3 DEF disease 105.0
如果不需要中间数据帧(step1
,step2
),则可以将它们链接在一起:
final_df = (df1.groupby(['Sample','Compound'])
['Abundance'].sum()
.reset_index(name='Sum_Abundance')
.merge(df2, on='Sample')
.step2.groupby(['Compound','Cohort'])
['Sum_Abundance'].mean()
.reset_index(name='Avg_Abundance')
)