Question

假设我有一个如下所示的文件：

'''
MFTF2LH_LSetC1_D-10_hot50_fa00_bpmax
MFTF2LH_LSetC1_D-11_hot50_fa00_bpmax
MFTF2LH_LSetC1_D-01_hot56_fa00_bpmax
MFTF2LH_LSetC1_D-02_hot56_fa00_bpmax
MFTF2LH_LSetC1_D-03_hot56_fa00_bpmax
MFTF2LH_LSetC1_D-04_hot50_fa00_bpmax
MFTF2LH_LSetC1_D-07_hot43_fa00_bpmax
MFTF2LH_LSetC1_D-10_hot56_fa00_bpmax
'''

但有数百万或一行。

我想要做的是逐行阅读并忽略那些具有以下功能的行：

_D- XX _hot YY ，[01,07]包含 XX ， YY = 43 < strong>或 50
_D- XX _hot 56 ，[08,11]包含 XX

因此，对于上面的示例行，只会忽略最后三行。

我正在使用这个regex模式来完成这个技巧（test here）：

pattern = '(_D-0[1-7]_hot(43|50)|_D-0[8,9]_hot56|_D-1[0,1]_hot56)'

但我想知道是否有更好的方法来做，因为我只想返回一个布尔值;没有团体或任何东西。

我是regex btw。

的初学者

Answer 1

您可以通过在字符串中的不同位置进行替代匹配来改善您的模式。

使用

rx = re.compile(r'_D-(?:1[01]_hot56|0(?:[89]_hot56|[1-7]_hot(?:43|50)))')
# .... Read the file line by line ...
if not rx.search(line):
    # Ok, process

请参阅regex demo

模式详情：

_D- - 文字子字符串
(?: - 启动非捕获组（没有为与捕获组不同的组创建内存缓冲区）匹配：
- 1[01]_hot56 - 1，然后是0或1，然后_hot56
- | - 或
- 0 - 0字符然后
- (?: - 第二个非捕获组
  - [89]_hot56 - 8或9然后_hot56
  - |或
  - [1-7]_hot(?:43|50) - 从1到7的数字，然后是_hot，然后是43或50
- ) - 第二个非捕获组的结束
) - 第一个非捕获组的结束

Answer 2

我将grep与-v一起使用（还原匹配）：

grep -Ev "D-[0][1-7]_hot(?:43|50)|D-(?:08|09|10|11)_hot56" raw.txt > filtered.txt

它完全匹配你不想要的东西，然后恢复匹配。

改进正则表达式搜索模式

2 个答案: