Question

我有一个小模块，可以获得一个单词及其复数形式的引理。然后，它搜索句子，查找包含两个单词（单数或复数）的句子。我有它的工作，但我想知道是否有一个更优雅的方式来构建这个表达式。谢谢！注意：Python2

words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"

然后搜索：

pat = re.compile(pat)
for sentence in sentences:
    if len(pat.findall(sentence)) != 0:
        print sentence+'\n'

Answer 1

类似的东西：

[ x for x in sentences if re.search( '\bcell\b', x ) and
        ( re.search( '\bwolf\b', x ) or re.search( '\bwolves\b', x ) )]

Answer 2

问题在于，当您开始添加多个复合环视表达式时，您的算法复杂性就会失控。这将是使用正则表达式来解决这个问题的一个基本问题。

另一种方法是尝试使用Counter对每个句子进行一次O（n）传递，然后查询：

#helper function
def count_lemma(counter,*args):
    return sum(counter[word] for word in args)

from collections import Counter
from string import punctuation

for sentence in sentences:
    c = Counter(x.rstrip(punctuation).lower() for x in sentence.split())
    if all(count_lemma(c,*word) for word in words):
        print sentence

Python Regex无论是哪种情况

2 个答案: