我有以下数据框:
import pandas as pd
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"],
'Genes':["Canx","LOC101056688 /// Wars "],
'cv_filter':[ 0.134,0.290],
'Organ' :["LN","LV"]} )
df = df[["Probes","Genes","cv_filter","Organ"]]
看起来像这样:
In [16]: df
Out[16]:
Probes Genes cv_filter Organ
0 1415693_at Canx 0.134 LN
1 1415693_at LOC101056688 /// Wars 0.290 LV
我想要做的是根据其输入的Genes列拆分行 由' ///'。
分隔我想得到的结果是
Probes Genes cv_filter Organ
0 1415693_at Canx 0.134 LN
1 1415693_at LOC101056688 0.290 LV
2 1415693_at Wars 0.290 LV
总共我要检查约150K行。有没有快速的方法来处理?
答案 0 :(得分:1)
您可以先试用str.split
列Genes
,新建Series
和join
原始df
:
import pandas as pd
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"],
'Genes':["Canx","LOC101056688 /// Wars "],
'cv_filter':[ 0.134,0.290],
'Organ' :["LN","LV"]} )
df = df[["Probes","Genes","cv_filter","Organ"]]
print df
Probes Genes cv_filter Organ
0 1415693_at Canx 0.134 LN
1 1415693_at LOC101056688 /// Wars 0.290 LV
s = pd.DataFrame([ x.split('///') for x in df['Genes'].tolist() ], index=df.index).stack()
#or you can use approach from comment
#s = df['Genes'].str.split('///', expand=True).stack()
s.index = s.index.droplevel(-1)
s.name = 'Genes'
print s
0 Canx
1 LOC101056688
1 Wars
Name: Genes, dtype: object
#remove original columns, because error:
#ValueError: columns overlap but no suffix specified: Index([u'Genes'], dtype='object')
df = df.drop('Genes', axis=1)
df = df.join(s).reset_index(drop=True)
print df[["Probes","Genes","cv_filter","Organ"]]
Probes Genes cv_filter Organ
0 1415693_at Canx 0.134 LN
1 1415693_at LOC101056688 0.290 LV
2 1415693_at Wars 0.290 LV