Question

我有两个看起来像这样的CSV数据：

gene,stem1,stem2,stem3,b1,b2,b3,t1
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3

而且：

celltype,phenotype
SC,stem1
BC,b2
SC,stem2
SC,stem3
BC,b1
TC,t1
BC,b3

数据框如下所示：

In [5]: import pandas as pd
In [7]: main_df = pd.read_table("http://dpaste.com/2MRRRM3.txt", sep=",")

In [8]: main_df
Out[8]:
      gene  stem1  stem2  stem3  b1  b2  b3  t1
    0  foo     20     10     11  23  22  79   3
    1  bar     17     13    505  12  13  88   1
    2  qui     17     13      5  12  13  88   3


In [11]: source_df = pd.read_table("http://dpaste.com/091PNE5.txt", sep=",")

In [12]: source_df
Out[12]:
  celltype phenotype
0       SC     stem1
1       BC        b2
2       SC     stem2
3       SC     stem3
4       BC        b1
5       TC        t1
6       BC        b3

我想要做的是根据分组对main_df中的每一列进行平均在source_df。所以它最终看起来像这样：

       SC                BC                TC
foo   (20+10+11)/3     (23+22+79)/3        3/1
bar   (17+13+505)/3    (12+13+88)/3        1/1
qui   (17+13+5)/3      (12+13+88)/3        3/1

我怎样才能做到这一点？

Answer 1

您可以将source_df转换为dict并使用main_df上的.groupby()将其应用于axis=1：

main_df.set_index('gene', inplace=True)
col_dict = source_df.set_index('phenotype').squeeze().to_dict()
main_df.groupby(col_dict, axis=1).mean()

             BC          SC  TC
gene                           
foo   41.333333   13.666667   3
bar   37.666667  178.333333   1
qui   37.666667   11.666667   3

Answer 2

您可以为source_df和main_df设置索引，然后celltype使用pd.concat和groupby：

main_df.set_index('gene', inplace=True)
source_df.set_index("phenotype", inplace=True)

In [30]: pd.concat([main_df.T, source_df], axis=1)
Out[30]:
gene   foo  bar  qui celltype
b1      23   12   12       BC
b2      22   13   13       BC
b3      79   88   88       BC
stem1   20   17   17       SC
stem2   10   13   13       SC
stem3   11  505    5       SC
t1       3    1    3       TC


In [33]: pd.concat([main_df.T, source_df], axis=1).groupby(['celltype']).mean().T
Out[33]:
celltype         BC          SC  TC
gene
foo       41.333333   13.666667   3
bar       37.666667  178.333333   1
qui       37.666667   11.666667   3

如何根据另一个数据帧的分组对数据框中的列进行平均

2 个答案: