注意:我是Python的初学者,几天前刚刚开始使用Pandas。我有R的背景。
我正在尝试拆分pandas DataFrame,但一次只能拆分一个分隔符。
我的数据如下:
0 1 2 3 \
0 chr4:43571332-43571643 numsnp=3 length=312 state1,cn=0
1 chr5:179618873-179628421 numsnp=8 length=9,549 state1,cn=0
4 5 6
0 CCCC.A_1_TR27GD1 startsnp=S-3TZTE endsnp=S-4NDOX
1 CCCC.A_1_TR27GD1 startsnp=S-3IDBJ endsnp=S-4AKVJ
我希望我的输出看起来像这样:
Chromosome Start End NumSNP Length StartSNP EndSNP
0 4 43571332 43571643 3 312 S-3TZTE S-4NDOX
1 5 179618873 179628421 8 9,549 S-3IDBJ S-4AKVJ
我知道这很多,但这包括以下内容:
问题: 我已经能够使用下面的代码来完成它,但是想要一些更有效的代码的指导。
import pandas as pd
CNV = pd.read_csv('CCCC_cnv_practice.rawcnv', delimiter="\s+", engine='python', header=None)
#Get Chromosomes
ChrPos = pd.DataFrame(CNV[0].str.split(':',1).tolist(), columns = ['Chromosome','Position'])
Chromosome = ChrPos['Chromosome'].str.lstrip('chr')
#Get Start and End Positions
Positions = pd.DataFrame(ChrPos.Position.str.split('-',1).tolist(), columns = ['Start','End'])
#Get the Numsnp, Length, StartSNP, and EndSNP columns
Equals1 = CNV.iloc[:,1:3]
Equals2 = CNV.iloc[:,5:]
Equals = Equals1.join(Equals2)
TEST1 = pd.DataFrame(Equals[1].str.split('=',1).tolist())
TEST2 = pd.DataFrame(Equals[2].str.split('=',1).tolist())
TEST3 = pd.DataFrame(Equals[5].str.split('=',1).tolist())
TEST4 = pd.DataFrame(Equals[6].str.split('=',1).tolist())
#Put it all together
frames = [Chromosome, Positions, TEST1[1], TEST2[1], TEST3[1], TEST4[1]]
Data = pd.concat(frames, axis=1)
Data.columns = ['Chromosome', 'Start', 'End', 'NumSNP', 'Length', 'StartSNP', 'EndSNP']
答案 0 :(得分:1)
我认为你可以使用:
print df
0 1 2 3 \
0 chr4:43571332-43571643 numsnp=3 length=312 state1,cn=0
1 chr5:179618873-179628421 numsnp=8 length=9,549 state1,cn=0
4 5 6
0 CCCC.A_1_TR27GD1 startsnp=S-3TZTE endsnp=S-4NDOX
1 CCCC.A_1_TR27GD1 startsnp=S-3IDBJ endsnp=S-4AKVJ
#new empty dataframe
df1 = pd.DataFrame()
df1[['Chromosome', 'tmp']] = pd.DataFrame([ x.split(':') for x in df[0].tolist() ])
df1[['Start', 'End']] = pd.DataFrame([ x.split('-') for x in df1['tmp'].tolist() ])
#tmp is temporary column
df1[['tmp', 'NumSNP']] = pd.DataFrame([ x.split('=') for x in df[1].tolist() ])
df1[['tmp', 'Length']] = pd.DataFrame([ x.split('=') for x in df[2].tolist() ])
df1[['tmp', 'StartSNP']] = pd.DataFrame([ x.split('=') for x in df[5].tolist() ])
df1[['tmp', 'EndSNP']] = pd.DataFrame([ x.split('=') for x in df[6].tolist() ])
df1['Chromosome'] = df1['Chromosome'].str.lstrip('chr')
#delete tmp column
df1 = df1.drop(['tmp'], axis=1)
print df1
# Chromosome Start End NumSNP Length StartSNP EndSNP
#0 4 43571332 43571643 3 312 S-3TZTE S-4NDOX
#1 5 179618873 179628421 8 9,549 S-3IDBJ S-4AKVJ