在大文本中查找大量字符串的第一个匹配索引的最快方法

时间:2017-08-09 08:35:09

标签: python django python-2.7 text find

我在python 2.7.13中实现了一个快速查找算法。它做我想要的,但我有小的性能问题。这些是我的算法特性:

  • 我的文章是一篇HTML文章,通常介于5 000到5万个字符之间,但它可以大到300 000个字符。
  • 我有一个“单词”列表,可以包含特殊字符(é,à,ø,/ ...)和空格,通常为几百到几千字。单词长度为2到256个字符。
  • 我需要忽略HTML标记中包含的找到的项目
  • 我需要匹配文本中的索引
  • 我只需要每个单词的第一个匹配

我所拥有的是这种实施方式:

def find_indexes(text, words):
    words_indexes = []
    found_words = []
    authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']

    text_length = len(text)

    for j, word in enumerate(words):
        i = 0 

        # This loop serves to go to the next word find if the first one isn't valid (contained in another word or in HTML tag)
        while i != -1: 
            i = text.find(word, i + 1)

            if i + 1 + len(word) < text_length:

                # We check the before and after character of the word because some words can be contained in others
                # Like "vision" is in "revision". As well as being contained in HTML tags
                before = text[i - 1]
                after = text[i + len(word)]
                if (before in authorized_characters and
                    after in authorized_characters and not
                    (before == u'.' and after == u'.')):
                    words_indexes.append(i)
                    found_words.append(word)

                    i = -1

    return words_indexes, found_words

对于大单词列表和大文本,它开始需要相当长的时间(不是人性化的大,但它不是我所做的唯一处理,因为它是Django视图的一部分,因此改善时间总是好的。

使用theses 1282 wordsthis 231884 characters long text(从Waitbutwhy article获取和处理),我设法在计算机上执行大约0.3秒的执行。

但我觉得有一种更好的方法,因为find()方法占用了大部分的计算时间,正如您所看到的那样line_profiler

Total time: 0.28045 s
Function: find_indexes at line 332

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   332                                           @line_profiler
   333                                           def find_indexes(text, words):
   334         1            4      4.0      0.0      words_indexes = []
   335         1            2      2.0      0.0      found_words = []
   336         1            2      2.0      0.0      authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']
   337                                           
   338         1            2      2.0      0.0      text_length = len(text)
   339                                           
   340      1283         4362      3.4      0.7      for j, word in enumerate(words):
   341      1282         1646      1.3      0.3          i = 0
   342                                           
   343      3436        11402      3.3      1.8          while i != -1:
   344      2154       543861    252.5     86.2              i = text.find(word, i + 1)
   345                                           
   346      2154        22153     10.3      3.5              if i + 1 + len(word) < text_length:
   347                                           
   348                                                           # We check the before and after character of the word because some words can be contained in others
   349                                                           # Like "vision" is in "revision". As well as being contained in HTML tags
   350      2154        16388      7.6      2.6                  before = text[i - 1]
   351      2154        19939      9.3      3.2                  after = text[i + len(word)]
   352      2154         7720      3.6      1.2                  if (before in authorized_characters and
   353       531         1468      2.8      0.2                      after in authorized_characters and not
   354       135          278      2.1      0.0                      (before == u'.' and after == u'.')):
   355       135          783      5.8      0.1                      words_indexes.append(i)
   356       135          428      3.2      0.1                      found_words.append(word)
   357                                           
   358       135          573      4.2      0.1                      i = -1
   359                                           
   360         1            2      2.0      0.0      return words_indexes, found_words

1 个答案:

答案 0 :(得分:1)

这是一个使用HTML解析器的示例(因此它过滤掉文档中的文本元素以避免在属性/标记内找到文本),一个编译的正则表达式(它可以一次扫描所有单词而不是循环N多个次(你的主瓶颈)):

import ast
# regex (not the builtin one) and bs4 need to be pip installed 
import regex
from bs4 import BeautifulSoup

# Parse the document so we don't have to worry about HTML stuff
# and can find actual text content more easily
with open('text_to_find_the_words.txt') as fin:
    soup = BeautifulSoup(fin, 'html.parser')

# Get the words to look at and compile a regex to find them
# Might already be a list in memory instead of a file.
with open('list_of_words.txt') as fin:
    words = ast.literal_eval(fin.read())
    matching_words = regex.compile(r'\b(\L<words>)\b', words=words)

# For each matching text elements, do the highlighting
for match in soup.find_all(text=matching_words):
    subbed = matching_words.sub(r'<span style="background: yellow;">\1</span>', match))
    match.replace_with(BeautifulSoup(subbed, 'html.parser'))

# Write the results somewhere (probably to a HttpResponse object in your case)
with open('results.html', 'w') as fout:
    fout.write(str(soup))

如果需要,您需要调整此项以仅突出显示一个单词。