Question

当前，我正在编写一个python脚本来加载docx文件，然后逐行对其进行解析。我在文档中有两行这样的内容：基本保费和折扣基本保费。我能够编写一个正则表达式-> re.search（r“ ^ Discounted”）来获取字符串“ Discounted Base Premium”，但是我在编写一个拉“ Base Premium”时遇到了麻烦

我确实尝试编写了一个re.search（r“ Base Premium”），但是这两个字符串都符合预期。我知道在否定断言和断言之后有一个概念，但并没有完全理解该概念。

Answer 1

您确实可以使用negative lookbehind，如下所示：

(?<!Discounted )Base Premium

这意味着：匹配字符串"Base Premium"，但仅在字符串Discounted之前 not 时匹配。

Demo on Regex 101

考虑您的评论：如果“折扣”和“基本保费”之间可能存在文字/产品说明，并且假设您不知道这是什么，则首先必须通过搜索以下内容来找出答案：整个“ Discounted ... Base Premium”字符串。然后，您可以像上面一样查找带有负向后的非打折部分。

>>> text = """Thx for the response :) So if I have my test string "Water No
... n-Weather Base Premium" and put in the expression it matches, but if I 
... have another string "Discounted Water Non-Weather Base Premium" then it
... also matches which is what I am trying to avoid"""
...
>>> dbp = re.search(r"(Discounted (.+? Base Premium))", text)
>>> dbp.groups()
('Discounted Water Non-Weather Base Premium', 'Water Non-Weather Base Premium')
>>> dbp.span()
(153, 194)
>>> bp = re.search(r"(?<!Discounted )" + dbp.group(2), text)
>>> bp.span()
(53, 83)

Answer 2

一种选择是使用否定的前行(?!来避免基本保费在打折之后出现，并使用单词边界\b来防止单词成为较大单词的一部分。

^(?!.*\bDiscounted\b.*\bBase Premium\b).*\bBase Premium\b.*$

在匹配的部分

^字符串的开头
(?!负向前进，断言右边的不是
- .*\bDiscounted\b.*\bBase Premium\b折扣后的基本保费匹配
)提前关闭
.*匹配除换行符1次以上以外的所有字符
\bBase Premium\b匹配单词边界之间的基本溢价
.*匹配除换行符1次以上以外的所有字符
$字符串结尾

Regex demo

试图区分正则表达式中的两个稍微不同的字符串

2 个答案: