我有一个小模块,可以获得一个单词及其复数形式的引理。然后,它搜索句子,查找包含两个单词(单数或复数)的句子。我有它的工作,但我想知道是否有一个更优雅的方式来构建这个表达式。谢谢! 注意:Python2
words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"
然后搜索:
pat = re.compile(pat)
for sentence in sentences:
if len(pat.findall(sentence)) != 0:
print sentence+'\n'
答案 0 :(得分:0)
类似的东西:
[ x for x in sentences if re.search( '\bcell\b', x ) and
( re.search( '\bwolf\b', x ) or re.search( '\bwolves\b', x ) )]
答案 1 :(得分:0)
问题在于,当您开始添加多个复合环视表达式时,您的算法复杂性就会失控。这将是使用正则表达式来解决这个问题的一个基本问题。
另一种方法是尝试使用Counter
对每个句子进行一次O(n)传递,然后查询:
#helper function
def count_lemma(counter,*args):
return sum(counter[word] for word in args)
from collections import Counter
from string import punctuation
for sentence in sentences:
c = Counter(x.rstrip(punctuation).lower() for x in sentence.split())
if all(count_lemma(c,*word) for word in words):
print sentence