合并数据框熊猫时丢失列

时间:2020-04-01 19:29:38

标签: python pandas

我有一个多层次的问题。有两个表:

表1样本信息:

Sample Compound Label Abundance
 1     ABC      0      10
 1     ABC      1      50
 2     ABC      0      100
 2     ABC      0      5
 3     ABC      0      100
 4     ABC      0      5
 1     DEF      0      10
 1     DEF      1      50
 1     DEF      2      100
 2     DEF      0      5
 3     DEF      0      100
 3     DEF      1      5

表2同类群组信息:

Sample Cohort 
 1     control  
 2     control     
 3     disease     
 4     disease

我有三个任务:a)对表1中的每个样本的总丰度求和以产生类似的结果

Sample Compound Sum_Abundance
 1     ABC           60
 2     ABC           105
 3     ABC           100
 4     ABC           5

b)将它们与表2合并,以创建具有同类群组信息的列:

Sample Compound Sum_Abundance Cohort Info
 1     ABC           60       control
 2     ABC           105      control
 3     ABC           100      disease
 4     ABC           5        disease

c)同类群组中每种化合物的平均总和

Compound Avg_Abundance Cohort Info
   ABC           82.5       control
   ABC           57.5      disease

我尝试了以下步骤:

pivot_table=pd.pivot_table(table1, values=['Abundance'], index=['Sample', 'Name'], aggfunc = np.sum)
print(table1.head(2))
sum_table = pd.DataFrame(pivot_table)
cohort_df = pd.DataFrame(table2)
print(cohort_df.head())
merged_df = pd.merge(sum_table, cohort_df, on = "Sample")

这是合并两个框架但删除复合列的地方,无论我尝试什么,我都无法通过。如果我在列中输入“名称”,它将创建一个漂亮的输出,但是我不知道如何对字段求平均。

1 个答案:

答案 0 :(得分:1)

这就是我要做的:

step1 = (df1.groupby(['Sample','Compound'])
     ['Abundance'].sum()
    .reset_index(name='Sum_Abundance')
)

step2 = step1.merge(df2, on='Sample')

step3 = (step2.groupby(['Compound','Cohort'])
               ['Sum_Abundance'].mean()
              .reset_index(name='Avg_Abundance')
        )

输出:

  Compound   Cohort  Avg_Abundance
0      ABC  control           82.5
1      ABC  disease           52.5
2      DEF  control           82.5
3      DEF  disease          105.0

如果不需要中间数据帧(step1step2),则可以将它们链接在一起:

final_df = (df1.groupby(['Sample','Compound'])
                ['Abundance'].sum()
               .reset_index(name='Sum_Abundance')
               .merge(df2, on='Sample')
               .step2.groupby(['Compound','Cohort'])
                ['Sum_Abundance'].mean()
               .reset_index(name='Avg_Abundance')
           )