我需要绘制两个完全不同的数据帧(感谢生物学)。所有关于熊猫的教程都是基于更简单的转换,我无法解决这个问题(真正的新手)而没有4个嵌套循环而没有成功。对于解决这个问题的pythonic方法真的好奇,而不必回到Excel。
第一个是这样的df1。观察a-j类中数千个基因的零和1。
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(df1)
a b c d e f g h i j
gene1 1 0 1 0 1 0 1 1 1 0
gene2 0 1 0 0 0 0 0 0 1 0
gene3 0 1 1 1 1 1 0 0 0 0
gene4 1 0 1 0 0 1 0 1 1 1
gene5 0 0 1 0 0 0 0 0 0 0
gene6 0 1 0 0 1 0 1 0 1 0
gene7 1 1 0 1 1 0 0 0 1 0
gene8 0 0 0 1 1 1 1 0 1 0
gene9 1 0 1 0 1 0 1 1 0 1
gene10 1 0 0 0 1 0 1 0 1 1
然后第二个就像这个df2。较高级别类别(X-W)的映射,用于较低级别的类别。这个女孩有NaNs,没有索引。
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','Nan']},index=None)
print(df2)
W X Y Z
0 i a d g
1 j NaN b h
2 NaN NaN c e
3 Nan NaN f NaN
我需要的是像result1。这里有另一件棘手的事情。例如。 gene4在i和j类别中,两者都在W中,但我仍然只想要一个' 1'在result1.loc [' gene4',' W']。最终结果仍然需要是二进制的。
result1 = pd.DataFrame({'X': ['1','0','0','1','0','0','1','0','1','1'],
'Y': ['1','1','1','1','1','1','1','1','1','0'],
'Z': ['1','0','1','1','0','1','1','1','1','1'],
'W': ['1','1','0','1','0','1','1','1','1','1']}, index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(result1)
W X Y Z
gene1 1 1 1 1
gene2 1 0 1 0
gene3 0 0 1 1
gene4 1 1 1 1
gene5 0 0 1 0
gene6 1 0 1 1
gene7 1 1 1 1
gene8 1 0 1 1
gene9 1 1 1 1
gene10 1 1 0 1
这可能是另一种可能的结果格式。 [根据实际预期结果更新]。如果有人想要教他们两个(或者简单的相互转换),那么更多的欣赏和科学也会感激。
result1 = pd.DataFrame({'1': ['gene1','gene1','gene1','gene1'],
'2': ['gene2','gene4','gene2','gene3'],
'3': ['gene4','gene7','gene3','gene4'],
'4': ['gene6','gene9','gene4','gene6'],
'5': ['gene7','gene10','gene5','gene7'],
'6': ['gene8','NaN','gene6','gene8'],
'7': ['gene9','NaN','gene7','gene9'],
'8': ['gene10','NaN','gene8','gene10'],
'9': ['NaN','NaN','gene9','NaN'],
},
index = ['W','X','Y','Z'])
print(result1)
1 2 3 4 5 6 7 8 9
W gene1 gene2 gene4 gene6 gene7 gene8 gene9 gene10 NaN
X gene1 gene4 gene7 gene9 gene10 NaN NaN NaN NaN
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
Z gene1 gene3 gene4 gene6 gene7 gene8 gene9 gene10 NaN
非常感谢你耐心阅读这个长期的问题。
答案 0 :(得分:1)
我们走吧!我们来试试吧。
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','NaN']},index=None)
df2 = df2.replace('NaN',np.nan)
gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']
df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()
df_out = df3.groupby(['index','level_1'])['level_1'].count().unstack()
print(df_out)
输出:
level_1 W X Y Z
index
gene1 1.0 NaN NaN NaN
gene10 1.0 1.0 1.0 1.0
gene2 1.0 1.0 1.0 1.0
gene3 1.0 1.0 1.0 1.0
gene4 1.0 NaN 1.0 1.0
gene5 1.0 NaN 1.0 NaN
gene6 1.0 1.0 1.0 1.0
gene7 NaN 1.0 1.0 1.0
gene8 NaN NaN 1.0 1.0
gene9 1.0 NaN NaN 1.0
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','NaN']},index=None)
df2 = df2.replace('NaN',np.nan)
gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']
df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()
df3['cols'] = df3['index'].str.split('gene').str[1].astype(int)
df_out2 = df3.set_index(['level_1','cols'])['index'].unstack()
输出:
cols 1 2 3 4 5 6 7 8 9 10
level_1
W gene1 gene2 gene3 gene4 gene5 None gene7 gene8 gene9 gene10
X None None gene3 None gene5 None None gene8 gene9 gene10
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
Z None gene2 None gene4 None gene6 None gene8 gene9 None