Question

我有一个正则表达式，可以捕获文本文件中引号内的所有字符，但我想：

仅匹配此模式的第一个匹配项
从此模式匹配中排除某些字词

这是我到目前为止所拥有的：

((?:\\.|[^"\\])*)

匹配引号内的所有文字，如下所示：

＆＃34; 这是印有单词的文字吗？＆＃34;

但是，我希望模式只匹配第一次出现，所以我想在某些时候我需要{1}。

然后，我想排除某些词，我有这个：

^(?!.*word1|word2|word3)

但我对正则表达不太熟悉，无法将它们放在一起..

Answer 1

我认为您可以使用此正则表达式来匹配不包含列表中的单词的双引号中第一次出现的字符串：

^.*?(?!"[^"]*?\b(?:word1|word2|word3)\b[^"]*?")"([^"]+?)"(?=(?:(?:[^"]*"[^"]*){2})*[^"]*$)

请参阅demo

Sample code：

import re
p = re.compile(ur'^.*?(?!"[^"]*?\b(?:word1|word2|word3)\b[^"]*?")"([^"]+?)"(?=(?:(?:[^"]*"[^"]*){2})*[^"]*$)')
test_str = u"\"word that is not matched word1\" \"word2 word1 word3\" \"this is some text word4 with the word printed in it?\""
print re.search(p, test_str).group(1)

输出：

this is some text word4 with the word printed in it?

至于可维护性，可以从任何来源提取被排除的单词，并且可以动态构建正则表达式。

Answer 2

是否必须是单个正则表达式才能立即解决所有这些要求？如果您只是使用简单的正则表达式来查找引用的字符串，然后根据排除的单词黑名单过滤掉所有匹配项，最后选择剩下的第一个匹配项，那么您的代码可能会更易于维护。

excluded = ('excluded', 'forbidden')
text = 'So, "this string contains an excluded word". "This second string is thus the one we want to find!" another qu"oted st"ring ... and another "quoted string with a forbidden word"'

import re
quoted_strings = re.findall('".*?"', text)
allowed_quoted_strings = [q for q in quoted_strings if any(e in q for e in excluded)]
wanted_string = allowed_quoted_strings[0]

或者如果你喜欢一个巨大的单一表达

import re
wanted_string = [q for q in re.findall('".*?"', 'So, "this string contains an excluded word". "This second string is thus the one we want to find!" another qu"oted st"ring ... and another "quoted string with a forbidden word"') if any(e in q for e in ('excluded', 'forbidden'))][0]

正则表达式匹配引号之间第一次出现的字符串，但排除某些单词？

2 个答案: