我需要提取符合特定条件的文字中的所有字词,例如出现在一些字典中。
some_dict = set() # initialize from file
def test1(word):
return word in some_dict
def extract1(text):
return [word for word in text.split() if test1(word)]
是的,但字典中的一些条目包含多个单词,最多4个。
MAX_DEPTH = 4
def extract2(text):
words = text.split()
return [word for i, word in enumerate(words) if test2(words[i:i + MAX_DEPTH])]
def test2(words):
for phrase in (' '.join(words[:i]) for i in range(1, len(words))):
if phrase in some_dict:
return True
return False
哦,但是我需要整个短语,而不仅仅是第一个单词,所以
def extract3(text):
words = text.split()
res = []
for i in range(len(words)):
matched = test3(words[i:i + MAX_DEPTH])
if matched:
res.append(matched)
return res
def test3(words):
for phrase in (' '.join(words[:i]) for i in range(1, len(words))):
if phrase in some_dict:
return phrase
return None
好吧,但是如果一个多词短语匹配我需要跳过它而不是测试它的其他词,即使它们在词典中显示为单独的词。所以我需要一个可伸缩的迭代器 这是我尝试实施的一个:
from copy import copy
def extract4(text):
words = text.split()
res = []
it = iter(words)
try:
while True:
matched, it = test4(it)
if matched:
res.append(matched)
except StopIteration:
pass
return res
def test4(it):
words = [next(it)] # will raise StopIteration when the list is exhausted
save = copy(it)
try:
for _ in range(MAX_DEPTH):
phrase = ' '.join(words)
if phrase in some_dict:
return phrase, it # skip the phrase
words.append(next(it))
except StopIteration:
pass
return None, save # retract
我有点担心为文本中的每个单词创建迭代器的副本可能会对性能产生影响,因为它可能会很长。总的来说,这可以在风格和性能方面得到改善吗?
编辑:
This question提出了双向迭代器的解决方案,但我宁愿让客户端使用标准迭代器