对于如下序列:
NM_001003443 chr11 + 5925152 5926098 5925152 5926098 2 5925152,5925652,5925404,5926098,
我想要一个如下所示的信息行(未拼写,这意味着sys.argv中有'-s'):
>NM_00100343|chr11(+):5925152Z5926098
或(在sys.argv中拼接,没有'-s'):
>NM_00100343|chr11(+):5925152Z5926098|5925151Z5925404,5925652Z5926098
我试图这样做,但仍然会收到不正确的匹配,有人可以查看我的正则表达式,看看它是否正常和/或是否正确匹配?
我写道:
p ='(NM_ \ d +)\ s +(chr \ d +)\ s +([+ | - ])\ s +(\ d +)\ s +(\ d +)\ s +(\ d +)\ s +(\ d + )\ S +(\ d +)\ S +(\ d +),(\ d +),\ S +(\ d +),(\ d +),'
并尝试匹配它们(文件中的每一行看起来像上面给出的行示例,由fp = open(infile,'r')打开:
for line in fp:
r = search(p, line)
if '-s' in sys.argv and r:
wp.write('>'+r.group(1)+'|'+r.group(2)+'('+r.group(3)+')'+':'+r.group(4)+'-'+r.group(5))
else:
wp.write('>'+r.group(1)+'|'+r.group(2)+'('+r.group(3)+')'+':'+r.group(4)+'-'+r.group(5)+'|'+r.group(6)+'-'+r.group(11)+','+r.group(9)+'-'+r.group(12))
编辑,这看起来是否正确?
for line in fp:
line = line.replace(',',' ')
tokens = line.split()
if '-s' in sys.argv and r:
wp.write('>'+tokens[0]+'|'+tokens[1]+'('+tokens[2]+')'+':'+tokens[3]+'-'+tokens[4])
else:
wp.write('>'+tokens[0]+'|'+tokens[1]+'('+tokens[2]+')'+':'+tokens[3]+'-'+tokens[4]+'|'+tokens[5]+'-'+tokens[10]+','+tokens[8]+'-'+tokens[11])
答案 0 :(得分:2)
您需要的所有数据都以空格或逗号分隔,因此您根本不需要正则表达式。
mystring = mystring.replace(',', ' ') # convert all commas to spaces
tokens = mystring.split() # split at spaces
如果想坚持一个正则表达式,有一些错别字。这是正确的重复文本:
p = '(NM_\d+)\s+(chr\d+)\s+([+|-])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+),(\d+),\s+(\d+),(\d+),'
[+-]
没有parens和| \s+
(chr\d+)
\
中缺少,s+(