我有以下数据框:
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
看起来像这样:
In [17]: df
Out[17]:
probe gene cellA.1 cellA.2 cellB.1 cellB.2
0 a foo 5 12 15 5
1 b bar 0 90 3 7
2 c qux 1 13 11 11
3 d woz 0 0 2 1
请注意,这些值包含在共享相同子字符串的列中(例如cellA和cellB)。在实际情况下,单元格ID可以超过这两个,数值索引也可以更多(例如CellFoo.5)
我想要做的是获得平均值,使其看起来像这样
probe gene cellA cellB
a foo 9.5 10
b bar 45 5
c qux 7 11
d woz 0 1.5
如何通过熊猫实现这一目标?
答案 0 :(得分:3)
一种方法是创建一个函数,该函数采用列名并将其转换为您想要放入的组:
>>> df = df.set_index(["probe", "gene"])
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean()
cellA cellB
probe gene
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean().reset_index()
probe gene cellA cellB
0 a foo 8.5 10.0
1 b bar 45.0 5.0
2 c qux 7.0 11.0
3 d woz 0.0 1.5
请注意,我们设置索引(并在之后重置),因此我们不必特殊情况下我们不想触摸的组;另请注意,我们必须指定axis=1
因为我们要按列分组,而不是按行分组。
答案 1 :(得分:2)
您可以使用groupby()
:
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
mask = df.columns.str.contains(".", regex=False)
df1 = df.loc[:, ~mask]
df2 = df.loc[:, mask]
pd.concat([df1, df2.groupby(lambda name:name.split(".")[0], axis=1).mean()], axis=1)
答案 2 :(得分:0)
你可以使用列表理解。
In [1]: df['cellA'] = [(x+y)/2. for x,y in zip(df['cellA.1'], df['cellA.2'])]
In [2]: df['cellB'] = [(x+y)/2. for x,y in zip(df['cellB.1'], df['cellB.2'])]
In [3]: df = df[['probe', 'gene', 'cellA', 'cellB']]
In [4]: df
Out [4]:
probe gene cellA cellB
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5