Question

我有一个包含超过100,000个单词的文件。我要做的是通过字母表的每5个字母组合来计算出由最少数量的单词使用的5个字母。

我已经制定了一个基于python的程序，最终会得到答案，但是按照它的速度，它可能需要大约48小时，如果不是更长的话。部分问题是计算的绝对数量。我还没有弄清楚如何限制排列，以便只比较不同的字符串 - 因此仅仅对组合进行26 ⁵计算，然后将每个字符串与100,000个单词进行比较，至少在10 * 10 ¹¹的计算中计算出来。

有没有办法大幅加快这一过程，通过更有效的算法，或多线程或类似的东西？

对于有关算法效率的书籍/文章的任何建议也将不胜感激。

我目前的计划如下：

从itertools模块导入排列函数：

from itertools import permutations

询问该单词是否包含禁用字母：

def avoids (word, letters): 
    for characters in letters:
        for character in word:
            if character == characters:
                return False
    return True

计算文件中不包含禁用字符的单词数：

def number_of_words(file, letters):  

    open_file = open(file)

    x = 0 #counter
    for line in open_file:
        word = line.strip()
        if avoids(word, letters) == True:
        x += 1  
    return x

运行字母表中存在的五个字母的每个变体，并计算排除最少单词的组合：

def find_smallest():

    y = 0

    #every combination of letters
    for letters in permutations("abcdefghijklmnopqrstuvwxyz", 5): 
        x = number_of_words("words.txt", letters)
        #sets y to the initial value of x
        if y == 0:
            y = x
            print "Start point, combination: %s, amount: %d" % (letters, y)

        #If current combination is greater than all previous combinations set y to x
        elif x > y:
            y = x
            combination = letters
            duplication = 0
            print "New highest, combination: %s, amount: %d" % (letters, y)

        print "%s excludes the smallest number of words (%d)" % (combination, y)

运行程序：

find_smallest()

Answer 1

您可以使用组合代替排列
为什么不扫描所有单词一次，计算每个字母的出现次数，然后选择具有最少出现次数的5？

Answer 2

这不是关于提高排列效率的问题。这实际上是一个关于如何制作更智能算法的问题，它是一个数据结构问题。

我有一个包含超过100,000个单词的文件。我该怎么办通过字母表的每5个字母组合运行最少数量的单词使用的5个字母。

循环显示字母表中的26个字母，并计算列表中使用每个字母的单词数：

import string
alphabet = string.ascii_lowercase
counts = {k: sum(k in word.lower() for word in words) for k in alphabet}

这应该是相当快的，并且应该给你足够的信息来轻易地挑选出五个最不受欢迎的字母。

等效方法，可能更有效但可能不如上述清晰。

from itertools import chain
from collections import Counter
counter = Counter({k: 0 for k in string.ascii_lowercase})
counter.update(Counter(c for w in words for c in set(w.lower())))

如何用python提高排列算法的效率

2 个答案: