我想从单词列表中扫描文本中是否存在单词。如果文本是未格式化的,但它是markdown格式化的,这将很简单。目前,我正在使用正则表达式完成此操作:
import re
text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
found_words = []
for word in words:
word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
match = word_pattern.search(text)
if match:
found_words.append(word)
我正在处理非常长的单词列表(某种拒绝列表)和非常大的候选文本,因此速度对我很重要。这是相对高效,快捷的方法吗?有更好的方法吗?
答案 0 :(得分:1)
您是否考虑过删除星号前和后的星号?
import re
from timeit import default_timer as timer
text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
def regexpCheck(words, text, n):
found_words = []
start = timer()
for i in range(n):
for word in words:
word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
match = word_pattern.search(text)
if match:
found_words.append(word)
end = timer()
return (end - start)
def stripCheck(words, text, n):
found_words = []
start = timer()
for i in range(n):
for word in text.split():
candidate = word.strip('*')
if candidate in words:
found_words.append(candidate)
end = timer()
return (end - start)
n = 10000
print(stripCheck(words, text, n))
print(regexpCheck(words, text, n))
在我的跑步中,速度快了一个数量级:
0.010649851000000002
0.12086547399999999