Question

我正在研究python脚本，我遇到了一些我无法弄清楚的情况。在这部分中，我打开了一个文件，最初定位以>开头的行。但是，我想跳过那些具有以下正则表达式模式的行：

uce.+$
ENSOFAS.+$
_[AB]_[0-9]+$
_[AB]_[0-9]+_rc$

如果我的代码如下所示，只针对其中一个代码，则可行：

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search('uce.+$', line.strip()):
            print line

但我需要考虑所有其他可能性。我已经尝试了not re.search(('uce.+$ | ENSOFAS.+$'), line.strip()):，not re.search(('uce.+$' | 'ENSOFAS.+$'), line.strip()):和其他变体，但没有成功。如何让re.search考虑所有四种可能的正则表达式？

Answer 1

这是错误的：

not re.search(('uce.+$ | ENSOFAS.+$'), line.strip())

在将正则表达式组合在一起时，不要添加“空格以便清晰”，因为它们会被考虑在内。这有效：

not re.search('uce.+$|ENSOFAS.+$',line.strip())

Answer 2

如果您能够使用较新的regex模块，您可以定义如下例外：

import regex as re

string = """
uce123
ENSOFAS123
_A_123
_B_123_rc
this line should be matched
"""

exceptions = [r'uce.+$', r'ENSOFAS.+$', r'_[AB]_[0-9]+$', r'_[AB]_[0-9]+_rc$']

rx = re.compile(r'(?:{})(*SKIP)(*FAIL)|(.+)'.format("|".join(exceptions)), re.MULTILINE)

lines = rx.findall(string)
print(lines)
# ['this line should be matched']

基本上，这会设置一个数组exceptions，然后在整个表达式中加入。

“而不是重新研究”多重条件

2 个答案: