Question

我必须将多次出现的令牌替换为大的Unicode文本文档。目前，我正在遍历字典中的单词，并用sub替换已编译的正则表达式：

for token,replacement in dictionary.tokens().iteritems():
    r = re.compile(word_regex_unicode(token), flags=re.I | re.X | re.UNICODE)
    text = r.sub(replacement,text)

我的正则表达式在哪里

# UTF8 unicode word regex
def word_regex_unicode(word):
    return r"(?<!\S){}(?!\S)".format(re.escape(word))

这意味着必须编译新的正则表达式，然后对文档sub中是否存在的每个令牌都进行text调用。作为一种替代方法，可以使用re.finditer查找出现的令牌，然后如果找到令牌，则调用re.sub：

for token,replacement in dictionary.tokens().iteritems():
    r = re.compile(word_regex_unicode(token), flags=re.I | re.X | re.UNICODE)
    for m in r.finditer(token,text):
        # now call sub 
        text = r.sub(replacement,text)

因此避免在实际不需要时调用re.sub。使用re.finditer组结果可以改善最后一种方法：

for m in r.finditer(token,text):
    # match start: match.start()
    index = match.start()
    # replace from start to end
    text = text[:index] + token + text[index + 1:]

这些方法中哪个更快？

大型Unicode文本的Python finditer或sub

0 个答案: