Question

我正在开发一个项目，该项目需要针对非常大的字符串列表检查字符串，搜索字符串是列表中某个元素的子字符串的情况。

最初我有这种方法：

def isSubstring(subWord, words):
    for superWord in words:
            if superWord.find(subWord) != -1 and len(subWord) != len(superWord):
                return True

    return False

def checkForSubstrings(words):
    words.sort(key=len, reverse=False)

    while len(words) > 1:
        currentWord = words.pop(0)

        if isSubstring(currentWord, words):
            print("%s is a substring of some other string" % currentWord)

按长度对所有字符串进行排序，对于每个单词，只将它与较长的单词进行比较。

但是这种方法存在一个缺陷，即在列表排序期间，单词仍然与在其后面任意放置的相同长度的单词进行比较。

所以我更改了checkForSubstring方法：

def checkForSubstring(words):
    sameLengthWordsLists = [[w for w in words if len(w) == num] for num in set(len(i) for i in words)]

    for wordList in sameLengthWordsLists:
        words = words[len(wordList):]

        if len(words) == 0:
             break

        for currentWord in wordList:
            if isSubsumed(currentWord, words):
                print("%s is a substring of some other string" % currentWord)

此版本不是按长度排序，而是按字符串将字符串列表拆分为多个列表，然后针对每个较大字词列表检查每个列表。这解决了早期的问题。

但它的速度并不快，有人会建议更快的方法吗？目前，这是一个瓶颈。

Answer 1

根据我的评论，这样的事情：

def checkForSubstrings(words):
  # e.g: fo: [foo, foobar]
  super_strings = defaultdict(list)
  # e.g: foo: [fo, oo]
  substrings = defaultdict(list)
  words.sort(key=len, reverse=True)
  while words:
    # Nota: pop(0) is highly inefficient, as it moves all the list
    word = words.pop()
    subwords = substrings[word]
    # finding the smallest list of words that contain a substring of `word`
    current_words = min(super_strings[w] for w in subwords, key=len)
    if not current_words:
      current_words = words
    super_words = [w for w in current_words if len(w) > len(word) and w.find(word) > -1]
    for s in super_words:
      substrings[s].append(word)
    super_strings[word] = super_words
  # the result is in super_strings

如果没有2个单词是子串，或者全部都是，则不会改变任何内容。但是，如果只有一些，它应该加快一些事情。那并使用pop()代替pop(0)

Answer 2

如果 LARGE 字符串列表不是那么大，您可以构建一个 HUGE dict，其中包含每个可能的连续子字符串。利用该索引的优势，每次后续搜索的时间复杂度将下降到O（1），这可能会加快速度。

以下是我的示例代码：

# -*- coding: utf-8 -*-
import sys
from collections import defaultdict

text = """Sort all the strings by length, for each word, compare it only to the longer words.

But this method has a flaw in that words are still being compared to words of the same length which are arbitrarily placed after it during the list sort.

So I changed the "checkForSubstring" method:"""


def checkForSubstrings(words):
    # Building a big dict first, this may be a little slow and cosuming a lot memory
    d = defaultdict(set)
    for windex, word in enumerate(words):
        # Get all possible substrings of word
        for i in range(len(word)):
            for j in range(len(word)):
                if word[i:j+1]:
                    # Put (word_index, matches_whole) to our dict
                    d[word[i:j+1]].add((windex, word[i:j+1] == word))

    # You may call sys.getsizeof(d) to check memory usage
    # import sys; print sys.getsizeof(d)

    # Iter over words, find matches bug ignore the word itself
    for windex, word in enumerate(words):
        matches = d.get(word, [])
        for obj in matches:
            if not obj[1]:
                print("%s is a substring of some other string" % word)
                break

if __name__ == '__main__':
    words = text.lower().split()
    checkForSubstrings(words)

此脚本的结果：

sort is a substring of some other string
for is a substring of some other string
compare is a substring of some other string
it is a substring of some other string
method is a substring of some other string
a is a substring of some other string
in is a substring of some other string
words is a substring of some other string
are is a substring of some other string
words is a substring of some other string
length is a substring of some other string
are is a substring of some other string
it is a substring of some other string
so is a substring of some other string
i is a substring of some other string

在python

2 个答案: