我在格式文件中有一系列字符串:
>HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
我正在尝试找到一个正则表达式模式,该模式将删除下一个>
字符之间的>
字符下方的新行字符。所以最终结果如下:
>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
有谁知道我怎么能想出一个正则表达式来做这个?
旁注:这种格式在计算科学中作为FASTA格式很常见。
谢谢!
答案 0 :(得分:1)
如评论中所述,您最好的选择是使用现有的FASTA解析器。为什么不呢?
以下是基于领先的大于:
来加入行的方法def joinup(f):
buf = []
for line in f:
if line.startswith('>'):
if buf:
yield " ".join(buf)
yield line.rstrip()
buf = []
else:
buf.append(line.rstrip())
yield " ".join(buf)
for joined_line in joinup(open("...")):
# blah blah...
答案 1 :(得分:0)
鉴于>总是应该是新线上的第一个角色
“\ n([^>])”与“\ 1”
答案 2 :(得分:0)
您不必使用正则表达式:
[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]
应该有用。
In [43]: f=open('test.txt')
In [44]: contents=[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]
In [45]: contents
Out[45]:
['>HEADER_Text1\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada',
'>HEADER_Text2\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada',
'>HEADER_Text3\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada']
答案 3 :(得分:0)
你真的不想要一个正则表达式。对于这项工作,python和biopython是多余的。如果这实际上是FASTQ格式,只需使用sed
:
sed '/^>/ { N; N; N; s/\n/ /2g }' file
结果:
>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada