Question

我有以下要优化的代码：

if re.search(str(stringA), line) and re.search(str(stringB), line):
    .....
    .....

我尝试过：

stringAB = stringA + '.*' + stringB
if re.search(str(stringAB), line):
    .....
    .....

但是我得到的结果并不可靠。我在这里使用“ re.search”，因为这似乎是我可以搜索stringA和stringB中指定的模式的正则表达式的唯一方法。

此代码背后的逻辑是根据以下egrep命令示例建模的：

stringA=Success
stringB=mysqlDB01

egrep "${stringA}" /var/app/mydata | egrep "${stringB}"

如果有一种更好的方法可以不进行重新搜索，请告诉我。

Answer 1

执行此操作的一种方法是制作一个与两个单词都匹配的模式（使用\b，因此我们仅匹配完整的单词），使用re.findall检查所有匹配的字符串，然后使用set相等以确保两个词都匹配。

import re

stringA = "spam"
stringB = "egg"

words = {stringA, stringB}

# Make a pattern that matches either word
pat = re.compile(r"\b{}\b|\b{}\b".format(stringA, stringB))

data = [
    "this string has spam in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.findall(s)
    print(repr(s), found, set(found) == words)

输出

'this string has spam in it' ['spam'] False
'this string has egg in it' ['egg'] False
'this string has egg in it and another egg too' ['egg', 'egg'] False
'this string has both egg and spam in it' ['egg', 'spam'] True
"the word spams shouldn't match" [] False
"and eggs shouldn't match, either" [] False

set(found) == words的一种更有效的使用方法是使用words.issubset(found)，因为它跳过了found的显式转换。

正如乔恩·克莱门茨（Jon Clements）在评论中提到的那样，我们可以简化和概括该模式以处理任意数量的单词，并且在任何单词包含正则表达式元字符的情况下，都应使用re.escape。

pat = re.compile(r"\b({})\b".format("|".join(re.escape(word) for word in words)))

谢谢，乔恩！

这里是按照指定顺序匹配单词的版本。如果找到匹配项，则打印匹配的子字符串，否则打印无。

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

输出

'this string has spam and also egg, in the proper order': 'spam and also egg'
'this string has spam in it': None
'this string has spamegg in it': None
'this string has egg in it': None
'this string has egg in it and another egg too': None
'this string has both egg and spam in it': None
"the word spams shouldn't match": None
"and eggs shouldn't match, either": None

python在同一行上搜索不同的字符串

1 个答案: