我在python 2.7.13中实现了一个快速查找算法。它做我想要的,但我有小的性能问题。这些是我的算法特性:
我所拥有的是这种实施方式:
def find_indexes(text, words):
words_indexes = []
found_words = []
authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']
text_length = len(text)
for j, word in enumerate(words):
i = 0
# This loop serves to go to the next word find if the first one isn't valid (contained in another word or in HTML tag)
while i != -1:
i = text.find(word, i + 1)
if i + 1 + len(word) < text_length:
# We check the before and after character of the word because some words can be contained in others
# Like "vision" is in "revision". As well as being contained in HTML tags
before = text[i - 1]
after = text[i + len(word)]
if (before in authorized_characters and
after in authorized_characters and not
(before == u'.' and after == u'.')):
words_indexes.append(i)
found_words.append(word)
i = -1
return words_indexes, found_words
对于大单词列表和大文本,它开始需要相当长的时间(不是人性化的大,但它不是我所做的唯一处理,因为它是Django视图的一部分,因此改善时间总是好的。
使用theses 1282 words和this 231884 characters long text(从Waitbutwhy article获取和处理),我设法在计算机上执行大约0.3秒的执行。
但我觉得有一种更好的方法,因为find()
方法占用了大部分的计算时间,正如您所看到的那样line_profiler
Total time: 0.28045 s
Function: find_indexes at line 332
Line # Hits Time Per Hit % Time Line Contents
==============================================================
332 @line_profiler
333 def find_indexes(text, words):
334 1 4 4.0 0.0 words_indexes = []
335 1 2 2.0 0.0 found_words = []
336 1 2 2.0 0.0 authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']
337
338 1 2 2.0 0.0 text_length = len(text)
339
340 1283 4362 3.4 0.7 for j, word in enumerate(words):
341 1282 1646 1.3 0.3 i = 0
342
343 3436 11402 3.3 1.8 while i != -1:
344 2154 543861 252.5 86.2 i = text.find(word, i + 1)
345
346 2154 22153 10.3 3.5 if i + 1 + len(word) < text_length:
347
348 # We check the before and after character of the word because some words can be contained in others
349 # Like "vision" is in "revision". As well as being contained in HTML tags
350 2154 16388 7.6 2.6 before = text[i - 1]
351 2154 19939 9.3 3.2 after = text[i + len(word)]
352 2154 7720 3.6 1.2 if (before in authorized_characters and
353 531 1468 2.8 0.2 after in authorized_characters and not
354 135 278 2.1 0.0 (before == u'.' and after == u'.')):
355 135 783 5.8 0.1 words_indexes.append(i)
356 135 428 3.2 0.1 found_words.append(word)
357
358 135 573 4.2 0.1 i = -1
359
360 1 2 2.0 0.0 return words_indexes, found_words
答案 0 :(得分:1)
这是一个使用HTML解析器的示例(因此它过滤掉文档中的文本元素以避免在属性/标记内找到文本),一个编译的正则表达式(它可以一次扫描所有单词而不是循环N多个次(你的主瓶颈)):
import ast
# regex (not the builtin one) and bs4 need to be pip installed
import regex
from bs4 import BeautifulSoup
# Parse the document so we don't have to worry about HTML stuff
# and can find actual text content more easily
with open('text_to_find_the_words.txt') as fin:
soup = BeautifulSoup(fin, 'html.parser')
# Get the words to look at and compile a regex to find them
# Might already be a list in memory instead of a file.
with open('list_of_words.txt') as fin:
words = ast.literal_eval(fin.read())
matching_words = regex.compile(r'\b(\L<words>)\b', words=words)
# For each matching text elements, do the highlighting
for match in soup.find_all(text=matching_words):
subbed = matching_words.sub(r'<span style="background: yellow;">\1</span>', match))
match.replace_with(BeautifulSoup(subbed, 'html.parser'))
# Write the results somewhere (probably to a HttpResponse object in your case)
with open('results.html', 'w') as fout:
fout.write(str(soup))
如果需要,您需要调整此项以仅突出显示一个单词。