Question

我需要提取符合特定条件的文字中的所有字词，例如出现在一些字典中。

some_dict = set()    # initialize from file

def test1(word):
    return word in some_dict

def extract1(text):
    return [word for word in text.split() if test1(word)]

是的，但字典中的一些条目包含多个单词，最多4个。

MAX_DEPTH = 4

def extract2(text):
    words = text.split()
    return [word for i, word in enumerate(words) if test2(words[i:i + MAX_DEPTH])]

def test2(words):
    for phrase in (' '.join(words[:i]) for i in range(1, len(words))):
        if phrase in some_dict:
            return True
    return False

哦，但是我需要整个短语，而不仅仅是第一个单词，所以

def extract3(text):
    words = text.split()
    res = []
    for i in range(len(words)):
        matched = test3(words[i:i + MAX_DEPTH])
        if matched:
            res.append(matched)
    return res

def test3(words):
    for phrase in (' '.join(words[:i]) for i in range(1, len(words))):
        if phrase in some_dict:
            return phrase
    return None

好吧，但是如果一个多词短语匹配我需要跳过它而不是测试它的其他词，即使它们在词典中显示为单独的词。所以我需要一个可伸缩的迭代器这是我尝试实施的一个：

from copy import copy

def extract4(text):
    words = text.split()
    res = []
    it = iter(words)
    try:
        while True:
            matched, it = test4(it)
            if matched:
                res.append(matched)
    except StopIteration:
        pass

    return res

def test4(it):
    words = [next(it)]  # will raise StopIteration when the list is exhausted
    save = copy(it)
    try:
        for _ in range(MAX_DEPTH):
            phrase = ' '.join(words)
            if phrase in some_dict:
                return phrase, it   # skip the phrase
            words.append(next(it))
    except StopIteration:
        pass

    return None, save # retract

我有点担心为文本中的每个单词创建迭代器的副本可能会对性能产生影响，因为它可能会很长。总的来说，这可以在风格和性能方面得到改善吗？

编辑：
This question提出了双向迭代器的解决方案，但我宁愿让客户端使用标准迭代器

前瞻或可伸缩的迭代器

0 个答案: