Question

我需要使此函数运行更快（约快20倍），以满足所需的基准。我在最初的实现中做了很多改进，但是遇到了麻烦。

基本问题是：计算word中text的不区分大小写的次数。

复杂的条件包括：

必须是一个完整的词（在word“ Georges”中找不到text“ George”）
单引号应视为单词的一部分，除非连续多行
word实际上可能是一个短语（这意味着它可能包含空格，标点符号等）
不能使用正则表达式

我的基本实现是遍历text中的每个字符，保持我在word中的位置，如果该字符与word的相应字符匹配，则将其添加到本地字符串，将我在word和text中的位置提高，然后再走一次。找到符合条件的候选人（即我的本地字符串等于word）后，按照上述规则1和2，我检查周围的字符以确保符合条件的候选人是一个完整的单词。请注意，这种检查的频率不会足够严重地影响算法所花费的总时间。

到目前为止，我所做的最成功的优化：

在环外进行字符串小写和长度测量
检查word是否至少是text的子字符串，否则立即返回0
在我们完全匹配之前，不要费心检查完整的单词潜力
先计算发生次数（无规则），如果遇到该次数，立即退出循环

我已经使用pprofile逐行分析了代码，并且我代码的大部分运行时都是简单的行，例如增加计数器var，将match_candidate字符串重置为“”，索引转换为字符串，以及if语句。我没有包含validate_full_match的代码，因为它不是一个很耗时的用户。

我有没有低落的水果？我完全应该考虑使用另一种方法吗？

谢谢您的建议！

def count_occurences_in_text(word, text):
    """Number of occurences of word (case insensitive) in text

    Note that word can actually be any length of text, from a single
    character to a complete phrase; however, partial words do not
    count. For example:
    count_occurences_in_text("george", "I am Georges") returns 0
    while
    count_occurences_in_text("i am", "I am Georges") returns 1
    """
    # We perform some measurements and manipulation at the start to
    # avoid performing them repeatedly in the loop below
    text = text.lower()
    word = word.lower()
    max_matches = text.count(word)
    if max_matches == 0:
        return 0
    word_len = len(word)
    # Counter vars
    match_count = 0
    text_cursor = 0
    word_cursor = 0
    # We will build up match_candidate and check it against word
    match_candidate = ""
    for text_char in text:
        if text_char == word[word_cursor]:
            match_candidate += text_char
            if word == match_candidate:
                if validate_full_match(text, text_cursor, word_len):
                    match_count += 1
                    if match_count == max_matches:
                        break
                    word_cursor = 0
                    match_candidate = ""
            else:
                word_cursor += 1
        else:
            match_candidate = ""
            word_cursor = 0
        text_cursor += 1
    return match_count

Answer 1

Python字符串是不可变的，每次执行match_candidate += text_char时，您实际上是在制作一个新字符串，并将match_candidate以前版本的所有内容复制到该字符串中。假设您的单词是'helloworld'。如果有可能与文本中的'helloworl'匹配，请在此处执行(len(word)^2)操作。您可以通过维护索引来避免这种情况。这样可以节省很多操作。
max_matches = text.count(word)，可以通过检查是否到达文本结尾来避免这种情况。最初，使用此功能将使您O(len(text))可以避免。
validate_full_match在此功能中检查的内容。如果这样做，您可以通过比较各个字符时采取适当的步骤来避免这种情况。

Python易于编码，并且具有惊人的内置函数和构造。但是要进行优化，您需要确保跟踪每一行的复杂性。

如何进一步优化此文本匹配功能？

1 个答案: