我的目标是按“患者”列进行分组,并在=行中输出每个患者,然后依次从我的输入文件中输出多个列。理想情况下,我希望在柱头中有一个柜台。这是我的输入文件示例:
Patient Test panel gene alteration
1 A 54 APC E1345*
1 B 54 TP53 Y205H
1 C 54 APC V2278V
2 A 54 KRAS G12D
2 B 54 PTEN L25L
3 A 54 KRAS G13D
3 C 54 TP53 C141W
3 C 54 APC R876*
3 A 54 ERBB2 L663P
期望的输出将“患者”列分组,然后循环遍历“测试”,“基因”和“替代”以创建以下内容:
Patient Test gene alteration Test gene alteration Test gene alteration Test gene alteration Test gene alteration
1 A APC E1345* B TP53 Y205H C TP53 Y205H
2 A KRAS G12D B PTEN L25L
3 A KRAS G13D C TP53 C141W C APC R876* A ERBB2 L663P A ERBB2 L663P
理想情况下,最好是#Tests / gene / alteration,即Test_1 gene_1 alteration_1等。但是,我意识到这使事情变得复杂。
这是我尝试过的方法,我无法获得感兴趣的输出
df = pd.read_table(args.md, sep="\t")
df=pd.DataFrame(df) #I used an input file
values=grouped['gene'].apply('\t'.join).reset_index()
此函数的输出1)不允许我组合超过'gene'列,因此如果我使用['gene','Test']则无法提供所需的输出,以及2)联接'\ t '的字面输出为'\ t'而不是制表符
所以我尝试了
grouped=df.groupby('Patient')
print grouped
values=grouped['gene'].apply('\t'.join).reset_index()
print values
id_df = grouped['Test'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'Test{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
第二次尝试的结构不符合我的需求,但确实为我提供了一个反击
我想知道是否有人可以提供一些见识以获得所需的输出。我从查看SO时使用了上述命令,但无法进行故障排除。
答案 0 :(得分:1)
可能的解决方法之一。也许不是超级优雅,但可以。
grouped = df.groupby('Patient')
col = ['Patient']
data = []
for p, g in grouped:
d = {'Patient': p}
g.reset_index(inplace=True)
for i, row in g.iterrows():
for c in range(2, len(g.columns)):
col_name = g.columns[c] + '_' + str(i + 1)
d[col_name] = row[g.columns[c]]
if col_name not in col:
col.append(col_name)
data.append(d)
df = pd.DataFrame(data, columns=col)
答案 1 :(得分:1)
使用melt
,groupby
和unstack
的方法:
In []: df
Out[]:
Patient Test panel gene alteration
0 1 A 54 APC E1345*
1 1 B 54 TP53 Y205H
2 1 C 54 APC V2278V
3 2 A 54 KRAS G12D
4 2 B 54 PTEN L25L
5 3 A 54 KRAS G13D
6 3 C 54 TP53 C141W
7 3 C 54 APC R876*
8 3 A 54 ERBB2 L663P
pd.DataFrame.melt
可以整理此表:
In []: tidy = df.melt(id_vars=['Patient', 'Test'], value_vars=['panel', 'gene', 'alteration'])
In []: tidy
Out[]:
Patient Test variable value
0 1 A panel 54
1 1 B panel 54
2 1 C panel 54
3 2 A panel 54
4 2 B panel 54
5 3 A panel 54
6 3 C panel 54
7 3 C panel 54
8 3 A panel 54
9 1 A gene APC
10 1 B gene TP53
11 1 C gene APC
12 2 A gene KRAS
13 2 B gene PTEN
14 3 A gene KRAS
15 3 C gene TP53
16 3 C gene APC
17 3 A gene ERBB2
18 1 A alteration E1345*
19 1 B alteration Y205H
20 1 C alteration V2278V
21 2 A alteration G12D
22 2 B alteration L25L
23 3 A alteration G13D
24 3 C alteration C141W
25 3 C alteration R876*
26 3 A alteration L663P
In []: (tidy.groupby(['Patient', 'Test', 'variable']) # group by three levels of interest
...: .first() # access values as a dataframe
...: .unstack(level=[1,2])) # pivot on levels [1, 2] of multiindex
Out[]:
value
Test A B C
variable alteration gene panel alteration gene panel alteration gene panel
Patient
1 E1345* APC 54 Y205H TP53 54 V2278V APC 54
2 G12D KRAS 54 L25L PTEN 54 NaN NaN NaN
3 G13D KRAS 54 NaN NaN NaN C141W TP53 54
这给出了等效的结果:
In []: pd.crosstab(tidy.Patient, # index
[tidy.Test, tidy.variable], # columns
values=tidy.value,
aggfunc='first') # get first value
Out[]:
Test A B C
variable alteration gene panel alteration gene panel alteration gene panel
Patient
1 E1345* APC 54 Y205H TP53 54 V2278V APC 54
2 G12D KRAS 54 L25L PTEN 54 NaN NaN NaN
3 G13D KRAS 54 NaN NaN NaN C141W TP53 54