我正在开发一个项目,该项目需要针对非常大的字符串列表检查字符串,搜索字符串是列表中某个元素的子字符串的情况。
最初我有这种方法:
def isSubstring(subWord, words):
for superWord in words:
if superWord.find(subWord) != -1 and len(subWord) != len(superWord):
return True
return False
def checkForSubstrings(words):
words.sort(key=len, reverse=False)
while len(words) > 1:
currentWord = words.pop(0)
if isSubstring(currentWord, words):
print("%s is a substring of some other string" % currentWord)
按长度对所有字符串进行排序,对于每个单词,只将它与较长的单词进行比较。
但是这种方法存在一个缺陷,即在列表排序期间,单词仍然与在其后面任意放置的相同长度的单词进行比较。
所以我更改了checkForSubstring
方法:
def checkForSubstring(words):
sameLengthWordsLists = [[w for w in words if len(w) == num] for num in set(len(i) for i in words)]
for wordList in sameLengthWordsLists:
words = words[len(wordList):]
if len(words) == 0:
break
for currentWord in wordList:
if isSubsumed(currentWord, words):
print("%s is a substring of some other string" % currentWord)
此版本不是按长度排序,而是按字符串将字符串列表拆分为多个列表,然后针对每个较大字词列表检查每个列表。这解决了早期的问题。
但它的速度并不快,有人会建议更快的方法吗?目前,这是一个瓶颈。
答案 0 :(得分:1)
根据我的评论,这样的事情:
def checkForSubstrings(words):
# e.g: fo: [foo, foobar]
super_strings = defaultdict(list)
# e.g: foo: [fo, oo]
substrings = defaultdict(list)
words.sort(key=len, reverse=True)
while words:
# Nota: pop(0) is highly inefficient, as it moves all the list
word = words.pop()
subwords = substrings[word]
# finding the smallest list of words that contain a substring of `word`
current_words = min(super_strings[w] for w in subwords, key=len)
if not current_words:
current_words = words
super_words = [w for w in current_words if len(w) > len(word) and w.find(word) > -1]
for s in super_words:
substrings[s].append(word)
super_strings[word] = super_words
# the result is in super_strings
如果没有2个单词是子串,或者全部都是,则不会改变任何内容。但是,如果只有一些,它应该加快一些事情。那并使用pop()
代替pop(0)
答案 1 :(得分:0)
如果 LARGE 字符串列表不是那么大,您可以构建一个 HUGE dict,其中包含每个可能的连续子字符串。利用该索引的优势,每次后续搜索的时间复杂度将下降到O(1),这可能会加快速度。
以下是我的示例代码:
# -*- coding: utf-8 -*-
import sys
from collections import defaultdict
text = """Sort all the strings by length, for each word, compare it only to the longer words.
But this method has a flaw in that words are still being compared to words of the same length which are arbitrarily placed after it during the list sort.
So I changed the "checkForSubstring" method:"""
def checkForSubstrings(words):
# Building a big dict first, this may be a little slow and cosuming a lot memory
d = defaultdict(set)
for windex, word in enumerate(words):
# Get all possible substrings of word
for i in range(len(word)):
for j in range(len(word)):
if word[i:j+1]:
# Put (word_index, matches_whole) to our dict
d[word[i:j+1]].add((windex, word[i:j+1] == word))
# You may call sys.getsizeof(d) to check memory usage
# import sys; print sys.getsizeof(d)
# Iter over words, find matches bug ignore the word itself
for windex, word in enumerate(words):
matches = d.get(word, [])
for obj in matches:
if not obj[1]:
print("%s is a substring of some other string" % word)
break
if __name__ == '__main__':
words = text.lower().split()
checkForSubstrings(words)
此脚本的结果:
sort is a substring of some other string
for is a substring of some other string
compare is a substring of some other string
it is a substring of some other string
method is a substring of some other string
a is a substring of some other string
in is a substring of some other string
words is a substring of some other string
are is a substring of some other string
words is a substring of some other string
length is a substring of some other string
are is a substring of some other string
it is a substring of some other string
so is a substring of some other string
i is a substring of some other string