Question

以下函数返回列表中包含与输入单词完全相同的字符的单词数。单词中字符的顺序并不重要。但是，有一个包含数百万字的列表。执行此搜索的最有效和最快的方法是什么？

示例：

words_list = ['yek','lion','eky','ekky','kkey','opt'];

如果我们将单词“key”与列表中的单词匹配，则该函数仅返回“yek”和“eky”，因为它们与“key”共享相同的完全字符，而不管顺序如何。

以下是我写的功能

def find_a4(words_list, word):
    # all possible permutations of the word that we are looking for
    # it's a set of words 
    word_permutations = set([''.join(p) for p in permutations(word)])
    word_size = len(word)
    count = 0

    for word in word_list:
        # in the case of word "key", 
        # we only accept words that have 3 characters 
        # and they are in the word_permutations 
        if len(word) == word_size and word in word_permutations:
            count += 1

    return count

Answer 1

一个字典，其键是单词的排序版本：

word_list = ['yek','lion','eky','ekky','kkey','opt']

from collections import defaultdict
word_index = defaultdict(set)

for word in word_list:
    idx = tuple(sorted(word))
    word_index[idx].add(word)

# word_index = {
#    ('e', 'k', 'y'): {'yek', 'eky'},
#    ('i', 'l', 'n', 'o'): {'lion'},
#    ('e', 'k', 'k', 'y'): {'kkey', 'ekky'},
#    ('o', 'p', 't'): {'opt'}
# }

然后查询你会这样做：

def find_a4(word_index, word):
    idx = tuple(sorted(word))
    return len(word_index[idx])

或者，如果您需要返回实际的字词，请将其更改为return word_index[idx]。

效率：查询运行in average in O(1) time。

Answer 2

对于大字符串，您将有n!个排列进行搜索。我将在比较之前对所有字符串进行排序，这将是nlog（n），并且仅在长度匹配时才进行排序和比较 -

def find_a4(words_list, word):
    word = ''.join(sorted(word))
    word_size = len(word)
    count = 0
    for word1 in words_list:
        if len(word1) == word_size:
            if word == ''.join(sorted(word1)):
                count += 1
    return count

搜索字符串列表的高效且最快捷的方式

2 个答案: