在“思考Python:如何像计算机科学家一样思考”中有更好的练习9.3算法

时间:2014-03-19 12:42:55

标签: python algorithm

本书中的exercise 9.3要求读者找到5个禁用字母的组合,这些字母排除了this file中最少的单词数。

以下是我对第一部分的解决方案,我认为对他们没有问题

# if the word contain any letter in letters, return True,
# otherwise return False
def contain(word, letters):
    for letter in letters:
        if letter in word:
            return True
    return False

# return the number of words contain any letter in letters
def ncont(words, letters):
    count = 0
    for word in words:
        if contain(word, letters):
            count += 1
return count

但对于上面的问题,我只能想到一个强力算法,就是尝试各种可能的组合,确切地说有26个! / 5! = 65780种组合,下面是实施:

def get_lset(nlt, alphabet, cur_set):
    global min_n, min_set
    # when get enough letters 
    if nlt <= 0:
        cur_n = ncont(words, ''.join(cur_set))
        if min_n == -1 or cur_n < min_n:
            min_n = cur_n
            min_set = cur_set.copy()
        print(''.join(cur_set), cur_n, ' *->', min_n, ''.join(min_set))
    # otherwise find the result letters in a recursive way
    else:
        cur_set.append(None)
        for i in range(len(alphabet)):
            cur_set[-1] = alphabet[i]
            get_lset(nlt-1, alphabet[i+1:], cur_set)
        cur_set.pop()

然后像这样调用上面的函数:

if __name__ == '__main__':
    min_n = -1
    min_set = []
    with open('words.txt', 'r') as fin:
        words = [line.strip() for line in fin]
    get_lset(5, list(string.ascii_lowercase), [])
    print(min_set, min_n)

但这个解决方案非常慢,我想知道这个问题有更好的算法吗?任何建议都会很好!

3 个答案:

答案 0 :(得分:3)

首先,让我们更简洁地重写它

def contain(word, letters):
    return any(letter in word for letter in letters)

def ncont(words, letters):
    return sum(contain(word, letters) for word in words):

目前您的算法具有平均复杂度

O(len(letters) * len(a_word) * len(words))
  ---+----------------------   -+--------
     contain(word, letters)     ncont(words, letters)

我们可以使用set s:

来减少这种情况
def contain(word, letters):
    return not set(letters).isdisjoint(set(word))

减少到:

O(min(len(letters), len(a_word)) * len(words))
  ---+--------------------------   -+--------
     contain(word, letters)        ncont(words, letters)

根据https://wiki.python.org/moin/TimeComplexity


至于第二部分,使用itertools更容易理解算法:

import itertools

def minimum_letter_set(words, n):
    attempts = itertools.combinations(string.ascii_lowercase, n)
    return min(attempts, key=lambda attempt: ncont(words, attempt))

但是,我们可以做得更好:

def minimum_letter_set(words, n):
    # build a lookup table for each letter to the set of words it features in
    by_letter = {
        letter: {
            word
            for word in words
            if letter in word
        }
        for letter in string.ascii_lowercase
    }

    # allowing us to define a function that finds words that match multiple letters
    def matching_words(letters):
        return set.union(*(by_letter[l] for l in letters))

    # find all 5 letter combinations
    attempts = itertools.combinations(string.ascii_lowercase, n)

    # and return the one that matches the fewest words
    return min(attempts, key=lambda a: len(matching_words(a))))

我不相信这会有更低的算法复杂度,但它肯定会省去过滤单词列表的重复工作。

答案 1 :(得分:0)

这是我的想法:

首先计算排除[l],将字母映射到字母l的排除字的集合。

计算这26组中最小的五组的并集。这为您提供了一个公平的临时最低结果&#34;。

然后,不要使用itertools.combinations来探索5个字母的所有组合,而是编写自己的算法来做到这一点。计算&#34;排除&#34;的联盟在里面设置。在这个算法中,如果对于第一个i字母(i&lt; 5),&#34;排除&#34; set已经超过&#34;临时最小结果&#34;,您根本不需要考虑以下字母。如果您发现五个字母组合比当前&#34;临时最小结果&#34;更好,请更新它。

答案 2 :(得分:0)

我的解决方案在这里:

    def smallest_set(filename):
        avoid_dict = dict.fromkeys(ascii_letters.lower(), 0)
        with open(filename) as file_handler:
            for line in file_handler:
                for key in avoid_dict:
                    if key not in line:
                        avoid_dict[key] += 1
        avoid_stats_sorted = sorted(avoid_dict, key=avoid_dict.get,
reverse=True)
        return ''.join([item for item in avoid_stats_sorted[:5]])