如何对列名称中共享相同子字符串的列的值进行平均

时间:2015-09-03 02:36:50

标签: python pandas

我有以下数据框:

import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1]  })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]

看起来像这样:

In [17]: df
Out[17]:
  probe gene  cellA.1  cellA.2  cellB.1  cellB.2
0     a  foo        5       12       15        5
1     b  bar        0       90        3        7
2     c  qux        1       13       11       11
3     d  woz        0        0        2        1

请注意,这些值包含在共享相同子字符串的列中(例如cellA和cellB)。在实际情况下,单元格ID可以超过这两个,数值索引也可以更多(例如CellFoo.5)

我想要做的是获得平均值,使其看起来像这样

     probe gene  cellA  cellB
     a  foo        9.5     10      
     b  bar        45      5       
     c  qux        7       11       
     d  woz        0       1.5        

如何通过熊猫实现这一目标?

3 个答案:

答案 0 :(得分:3)

一种方法是创建一个函数,该函数采用列名并将其转换为您想要放入的组:

>>> df = df.set_index(["probe", "gene"])
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean()
            cellA  cellB
probe gene              
a     foo     8.5   10.0
b     bar    45.0    5.0
c     qux     7.0   11.0
d     woz     0.0    1.5
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean().reset_index()
  probe gene  cellA  cellB
0     a  foo    8.5   10.0
1     b  bar   45.0    5.0
2     c  qux    7.0   11.0
3     d  woz    0.0    1.5

请注意,我们设置索引(并在之后重置),因此我们不必特殊情况下我们不想触摸的组;另请注意,我们必须指定axis=1因为我们要按列分组,而不是按行分组。

答案 1 :(得分:2)

您可以使用groupby()

import pandas as pd

df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1]  })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]

mask = df.columns.str.contains(".", regex=False)
df1 = df.loc[:, ~mask]
df2 = df.loc[:, mask]
pd.concat([df1, df2.groupby(lambda name:name.split(".")[0], axis=1).mean()], axis=1)

答案 2 :(得分:0)

你可以使用列表理解。

In [1]: df['cellA'] = [(x+y)/2. for x,y in zip(df['cellA.1'], df['cellA.2'])]
In [2]: df['cellB'] = [(x+y)/2. for x,y in zip(df['cellB.1'], df['cellB.2'])]
In [3]: df = df[['probe', 'gene', 'cellA', 'cellB']]
In [4]: df
Out [4]: 
     probe gene  cellA  cellB
     a     foo   8.5    10.0      
     b     bar   45.0   5.0       
     c     qux   7.0    11.0       
     d     woz   0.0    1.5