使用正则表达式格式化文档

时间:2015-03-07 15:59:13

标签: python regex

我有一段文字就是这样 -

Authority: soaplab.icapture.ubc.ca - EMBOSS seqret program. The sequence source is USA (Uniform Sequence Address). This means that you pass in a database name as the namespace.and an entry from the db as the id, e.g.db = embl and id = X13776. The input format is swiss.The output format is fasta.

使用正则表达式,我必须正确地格式化文本 - 在一个句子的长度(两个句号之间)大于7的任何地方完全停止后放一个空格。给定文本中的一个例子是 - egdb ...和瑞士。输出格式是......

我使用以下正则表达式来匹配这样的句号 -

\.[^\.]{7,}\.

例如,

Input - The input format is swiss.The output format is fasta.
Output - The input format is swiss. The output format is fasta.

Input - from the db as the id, e.g.db = embl and id = X13776.
Output - from the db as the id, e.g. db = embl and id = X13776.

然而,这匹配整个句子的长度为7或更大,而不仅仅是句号。我如何只匹配我想要的那两个案例?

1 个答案:

答案 0 :(得分:0)

您可以在re.sub函数中使用捕获组或基于前瞻性的正面regex。

>>> import re
>>> s = '''The input format is swiss.The output format is fasta.
from the db as the id, e.g.db = embl and id = X13776.'''
>>> print(re.sub(r'\.([^.]{7,}\.)', r'. \1', s))
The input format is swiss. The output format is fasta.
from the db as the id, e.g. db = embl and id = X13776.

[^.]{7,}匹配任何字符,但不匹配点,7次或更多次。因此,在两个点之间必须存在至少7个字符。


>>> print(re.sub(r'\.(?=[^.]{7,}\.)', r'. ', s))
The input format is swiss. The output format is fasta. 
from the db as the id, e.g. db = embl and id = X13776.

\.(?=[^.]{7,}\.)仅在点后跟一个至少有7个字符的句子时匹配。如果是,则用点+空格替换匹配的点。


>>> print(re.sub(r'(?<=\.)(?=[^.]{7,}\.)', r' ', s))
The input format is swiss. The output format is fasta. 
from the db as the id, e.g. db = embl and id = X13776.