Question

我正在处理一个文件，即Genbank条目（类似于this）

我的目标是提取CDS行中的数字，例如：

    CDS             join(1200..1401,3490..4302)

但是我的正则表达式也应该能够从多行中提取数字，如下所示：

     CDS            join(1200..1401,1550..1613,1900..2010,2200..2250,
                 2300..2660,2800..2999,3100..3333)

我正在使用这个正则表达式：

     import re
     match=re.compile('\w+\D+\W*(\d+)\D*')
     result=match.findall(line)
     print(result)

这给了我正确的数字，但也提供了文件其余部分的数字，比如

 gene            complement(3300..4037)

那么如何更改我的正则表达式来获取数字呢？我应该只使用正则表达式..

我将使用这些数字来打印基本序列的编码部分。

Answer 1

您可以使用 Matthew Barnett （提供\G功能）的大量改进的regex模块。有了这个，您可以提出以下代码：

import regex as re
rx = re.compile("""
            (?:
                CDS\s+join\(    # look for CDS, followed by whitespace and join(
                |               # OR
                (?!\A)\G        # make sure it's not the start of the string and \G 
                [.,\s]+         # followed by ., or whitespace
            )
            (\d+)               # capture these digits
                """, re.VERBOSE)

string = """
         CDS            join(1200..1401,1550..1613,1900..2010,2200..2250,
                     2300..2660,2800..2999,3100..3333)
"""

numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']

\G确保正则表达式引擎在最后一场比赛结束时查找下一场比赛。请参阅a demo on regex101.com（在PHP中，因为模拟器不为Python [它使用原始re模块]提供相同的功能。）

远劣等解决方案（如果您只允许使用re模块），将使用外观：

(?<=[(.,\s])(\d+)(?=[,.)])

(?<=)是一个积极的外观背后，而(?=)是一个积极的外观提前，请参阅a demo for this approach on regex101.com。请注意，虽然可能存在一些误报。

Answer 2

以下re模式可能有效：

>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))

但是你需要在整个文本体上调用findall，而不是一次只调用一行。

您可以使用括号来获取数字：

>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302       # Numbers only

让我知道这是否达到了你想要的效果！

Python：正则表达式，用于获取重复的数字集

2 个答案: