Question

我有一个巨大的英语单词词典，我试图得到包含相同字母的所有单词，例如，如果给出aplep，我希望它给苹果或者如果鉴于applej它也应该是苹果，我试图从这个词中获得所有的排列，但是对于大词来说它变得不合理，任何人都有任何想法？

编辑：字典是带行分隔符的txt文件

感谢。

Answer 1

您可以计算每个单词的字母，并确定搜索单词是否为如下子集：

from collections import Counter

def subset(c1, c2):
    for c, count in c1.items():
        if 0 < count > c2[c]:
            return False
    return True

words = ['apple', 'pear', 'orange', 'applej', 'appppppplllllleeee', 'aple']
find_word = Counter('aplep')

for word in words:
    if subset(find_word, Counter(word)):
        print word

这将显示三个匹配项：

apple
applej
appppppplllllleeee

要从名为words.txt的文件中读取您的单词列表，假设每个单词都在其自己的行中：

with open('words.txt') as f_input:
    words = f_input.read().splitlines()

find_word = Counter('aplep')

for word in words:
    if subset(find_word, Counter(word)):
        print word

Answer 2

重复@ Jean-FrançoisFabre的回答。

您可以将排序后的单词存储在一种前缀树中，这种数据结构在叶子中包含已排序的单词，而单词的路径则是这些单词的增加前缀。例如：如果字典中有'abc'和'abd'，则结构看起来像

a \ ab / \ abc abd

如果你想要所有包含'ab'的单词遍历树，并使用所有后续节点作为字典中的键来查找未排序的单词

Answer 3

这样的事情？使用set来获取所有排列

given_word = "apple"
list_of_all_words_in_dictionary = ["applepie", "anapple"]

given_word = set(given_word )
for word in list_of_all_words_in_dictionary:
    if given_word.issubset(set(word)):
        #do something

这个想法的局限性在于，即使像“aple”这样的单词也会通过测试，如果你只想让“apple”/“alepp”通过测试并且不想要“aple”，那么可能不会使用set（）获取单词char set，使用自定义函数首先计算每个char的数量：

from collections import defaultdict as dd

def count_char(word):
    word_dict = dd(word)
    for char in word:
        word_dict[char] += 1
    return word_dict.items()

Answer 4

阅读你的词典，对于每个单词word1，再次阅读你的词典，为每个单词word2

if word1.strip(word2) == '':
  print word1 " contains only letters from " word2

在文本文件中查找单词的所有字母

4 个答案: