Python Regex无论是哪种情况

时间:2013-12-07 21:35:25

标签: python regex nltk lemmatization

我有一个小模块,可以获得一个单词及其复数形式的引理。然后,它搜索句子,查找包含两个单词(单数或复数)的句子。我有它的工作,但我想知道是否有一个更优雅的方式来构建这个表达式。谢谢! 注意:Python2

words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"

然后搜索:

pat = re.compile(pat)
for sentence in sentences:
    if len(pat.findall(sentence)) != 0:
        print sentence+'\n'

2 个答案:

答案 0 :(得分:0)

类似的东西:

[ x for x in sentences if re.search( '\bcell\b', x ) and
        ( re.search( '\bwolf\b', x ) or re.search( '\bwolves\b', x ) )]

答案 1 :(得分:0)

问题在于,当您开始添加多个复合环视表达式时,您的算法复杂性就会失控。这将是使用正则表达式来解决这个问题的一个基本问题。

另一种方法是尝试使用Counter对每个句子进行一次O(n)传递,然后查询:

#helper function
def count_lemma(counter,*args):
    return sum(counter[word] for word in args)

from collections import Counter
from string import punctuation

for sentence in sentences:
    c = Counter(x.rstrip(punctuation).lower() for x in sentence.split())
    if all(count_lemma(c,*word) for word in words):
        print sentence