Question

我有一个多层次的问题。有两个表：

表1样本信息：

Sample Compound Label Abundance
 1     ABC      0      10
 1     ABC      1      50
 2     ABC      0      100
 2     ABC      0      5
 3     ABC      0      100
 4     ABC      0      5
 1     DEF      0      10
 1     DEF      1      50
 1     DEF      2      100
 2     DEF      0      5
 3     DEF      0      100
 3     DEF      1      5

表2同类群组信息：

Sample Cohort 
 1     control  
 2     control     
 3     disease     
 4     disease

我有三个任务：a）对表1中的每个样本的总丰度求和以产生类似的结果

Sample Compound Sum_Abundance
 1     ABC           60
 2     ABC           105
 3     ABC           100
 4     ABC           5

b）将它们与表2合并，以创建具有同类群组信息的列：

Sample Compound Sum_Abundance Cohort Info
 1     ABC           60       control
 2     ABC           105      control
 3     ABC           100      disease
 4     ABC           5        disease

c）同类群组中每种化合物的平均总和

Compound Avg_Abundance Cohort Info
   ABC           82.5       control
   ABC           57.5      disease

我尝试了以下步骤：

pivot_table=pd.pivot_table(table1, values=['Abundance'], index=['Sample', 'Name'], aggfunc = np.sum)
print(table1.head(2))
sum_table = pd.DataFrame(pivot_table)
cohort_df = pd.DataFrame(table2)
print(cohort_df.head())
merged_df = pd.merge(sum_table, cohort_df, on = "Sample")

这是合并两个框架但删除复合列的地方，无论我尝试什么，我都无法通过。如果我在列中输入“名称”，它将创建一个漂亮的输出，但是我不知道如何对字段求平均。

Answer 1

这就是我要做的：

step1 = (df1.groupby(['Sample','Compound'])
     ['Abundance'].sum()
    .reset_index(name='Sum_Abundance')
)

step2 = step1.merge(df2, on='Sample')

step3 = (step2.groupby(['Compound','Cohort'])
               ['Sum_Abundance'].mean()
              .reset_index(name='Avg_Abundance')
        )

输出：

  Compound   Cohort  Avg_Abundance
0      ABC  control           82.5
1      ABC  disease           52.5
2      DEF  control           82.5
3      DEF  disease          105.0

如果不需要中间数据帧（step1，step2），则可以将它们链接在一起：

final_df = (df1.groupby(['Sample','Compound'])
                ['Abundance'].sum()
               .reset_index(name='Sum_Abundance')
               .merge(df2, on='Sample')
               .step2.groupby(['Compound','Cohort'])
                ['Sum_Abundance'].mean()
               .reset_index(name='Avg_Abundance')
           )

合并数据框熊猫时丢失列

1 个答案: