快速检查降价单词的方法?

时间:2019-07-17 16:01:41

标签: python regex nlp markdown

我想从单词列表中扫描文本中是否存在单词。如果文本是未格式化的,但它是markdown格式化的,这将很简单。目前,我正在使用正则表达式完成此操作:

import re

text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
found_words = []

for word in words:
    word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
    match = word_pattern.search(text)
    if match:
        found_words.append(word)

我正在处理非常长的单词列表(某种拒绝列表)和非常大的候选文本,因此速度对我很重要。这是相对高效,快捷的方法吗?有更好的方法吗?

1 个答案:

答案 0 :(得分:1)

您是否考虑过删除星号前和后的星号?

import re

from timeit import default_timer as timer


text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']

def regexpCheck(words, text, n):
    found_words = []

    start = timer()
    for i in range(n):
        for word in words:
            word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
            match = word_pattern.search(text)
            if match:
                found_words.append(word)

    end = timer()
    return (end - start)


def stripCheck(words, text, n):
    found_words = []

    start = timer()
    for i in range(n):
        for word in text.split():
            candidate = word.strip('*')
            if candidate in words:
                found_words.append(candidate)
    end = timer()

    return (end - start)


n = 10000
print(stripCheck(words, text, n))
print(regexpCheck(words, text, n))

在我的跑步中,速度快了一个数量级:

0.010649851000000002
0.12086547399999999