Python-哪个单词可以删除最多的连续字母,仍然是字典有效的单词?

时间:2011-05-21 22:18:21

标签: python algorithm performance word

我使用这个可怕且效率低下的实现来找到可以删除最多连续最后一个字母但仍然是单词的单词。

例如,Rodeo是众所周知的:Rodeo,Rode,Rod,Ro。 该计划找到了“作曲家”:作曲家,作曲家,作曲家,作曲家,作品

我想知道如何创建一个程序,找到可以删除其中任何字母(不仅仅是最后一个字母)的最长单词,它仍然被认为是一个单词:

例如:野兽,最好,下注,是 - 将是一个有效的可能性

这是我的程序,找到一个删除连续字母的程序(我也有兴趣听听如何改进和优化):

#Recursive function that finds how many letters can be removed from a word and
#it still be valid.  
def wordCheck(word, wordList, counter):

    if len(word)>=1:
        if word in wordList:
            return (wordCheck(word[0:counter-1], wordList, counter-1))
        else:
            return counter
    return counter


def main():
    a = open('C:\\Python32\\megalist2.txt', 'r+')
    wordList = set([line.strip() for line in a])
    #megaList contains a sorted list of tuple of 
    #(the word, how many letters can be removed  consecutively)
    megaList = sorted([(i, len(i)-1- wordCheck(i, wordList, len(i))) for i in wordList], key= lambda megaList: megaList[1])


    for i in megaList:
        if i[1] > 3:
            print (i)

if __name__ == '__main__':
    main()

3 个答案:

答案 0 :(得分:10)

for each english word W:
    for each letter you can remove:
        remove the letter
        if the result W' is also word:
            draw a line W->W'
for each english word W:
    connect ROOT-> each english word W
use a graph search algorithm to find the longest path starting at ROOT
    (specifically, the words are now in a directed acyclic graph; traverse
    the graph left-right-top-down, i.e. in a "topological sort", keeping
    track of the longest candidate path to reach each node; this achieves 
    the result in linear time)

这个算法只需要线性O(#wordsInEnglish * averageWordLength)时间!基本上只要读取输入

可以很容易地修改它以找到连续的字母被移除:而不是像每个节点保持一个候选者(Node('rod')。candidate = rodeo->rode->rod),保持一个家庭每个节点的候选人数和你为了获得每个候选人而删除的信件的索引(节点('rod')。candidate = {rodeo->rod|e->rod|road->ro|d})。这有相同的运行时间。

答案 1 :(得分:8)

这是我刚刚写的一个实现。我的~235k单词列表在大约五秒内运行。输出不显示整个链,但您可以轻松地从输出中重新组合它。

# Load the words into a dictionary
words = dict((x.strip(), set()) for x in open("/usr/share/dict/words"))

# For each word, remove each letter and see if the remaining word is still
# in the dictionary. If so, add it to the set of shorter words associated with
# that word in the dictionary.
# For example, bear -> {ear, bar, ber}
for w in words:
    for i in range(len(w)):
        shorter = w[:i] + w[i+1:]
        if shorter in words:
            words[w].add(shorter)

# Sort the words by length so we process the shortest ones first
sortedwords = sorted(words, key=len)

# For each word, the maximum chain length is:
#  - the maximum of the chain lengths of each shorter word, if any
#  - or 0 if there are no shorter words for this word
# Note that because sortedwords is sorted by length, we will always
# have maxlength[x] already available for each shorter word x
maxlength = {}
for w in sortedwords:
    if words[w]:
        maxlength[w] = 1 + max(maxlength[x] for x in words[w])
    else:
        maxlength[w] = 0

# Print the words in all chains for each of the top 10 words
toshow = sorted(words, key=lambda x: maxlength[x], reverse=True)[:10]
while toshow:
    w = toshow[0]
    print(w, [(x, maxlength[x]) for x in words[w]])
    toshow = toshow[1:] + list(x for x in words[w] if x not in toshow)

我字典中最长的单词链是:

  • abranchiate
  • branchiate
  • branchi
  • 分支
  • 牧场
  • RACH
  • ACH
  • 一个

答案 2 :(得分:1)

也许我只是错过了练习的重点,但是不应该简单的启发式规则能够减少很多搜索?特别是如果你想找到一个可以剪掉大多数字母的单词,你可能只想查看最大的单词并检查它们是否包含任何最小的单词。

例如,大量单词包括字母“a”和“i”,它们都是有效的英语单词。此外,较长的单词将越来越可能具有一个或两个字母。您可以跳过任何没有“a”或“i”的单词。

你可能会把这个用到Greg的解决方案中,实际上,如果你先得到单词列表的排序副本,即:

# Similar to Greg's.  Reads into a dict
words = dict((x.strip(), None) for x in open("/usr/share/dict/words"))
# Generates a reverse sorted list from the dict (largest words first)
sortedwords = sorted(words, key=len, reverse=True)

# Largest possible reduction is making a longest word into 1 letter
longestPossible = len(sortedWords[0])-1

# Function that recursively finds shorter words and keeps count of reductions
def getMaxLettersRemoved(w, words, alreadyRemovedCount=0):
    # If you've already calculated the value, return it
    if words[w] is not None:
        return words[w]
    # Recursively calculate how many letters you can remove
    shorterPossibilities = [w[:i] + w[i+1:] for i in xrange(len(w))]
    # Calculate how max # of letters you can remove from shortened words
    totalRemoved = max([getMaxLettersRemoved(w, words, alreadyRemovedCount+1) for shorter in shorterPossibilities if shorter in words])
    # Total removed will be the same or will increase due to removals from shorter words
    totalRemoved = max(totalRemoved, alreadyRemovedCount)
    # Cache the result and return it
    words[w] = totalRemoved
    return totalRemoved 

# Go through words from largest to smallest, so long as they have 'a' or 'i'
bestNumRemoved = 0
for w in sortedwords:
    if 'a' in w or 'i' in w:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-1:
            break

# Need to make sure the best one found is better than any left
if bestNumRemoved < longestPossible:
    for w in sortedwords:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-2:
            break

所以这个在某些方面有所不同。首先,它先排序,这样你才能得到最大的单词。其次,它完全忽略了第一次通过时不包含'a'或'i'的任何单词。第三,它不需要计算每个单词或整个树以产生结果。相反,它只是在需要时对这些单词进行递归查看。

每次剪切一个字母并找到一个新单词时,它会运行相同的函数来查找它可以从该较小的单词中删除的字母数,加上已经从它来自的任何根单词中删除的数字。这在理论上应该相当快,因为​​它不需要在大多数单词上运行,因为它执行典型的优化技巧:检查它是否处于最佳边界。首先,它在'i'或'a'中找到了最好的可能性。然后,它会检查比找到的最好的单词更长的单词,以确保没有更好的选项,不包含任何一个字母,但至少2个字母更长(理论上它可以更好)。

这可能会有一些改进,可以更好地利用英语使用概率算法的规律,但我怀疑这可以作为一个确定性的。另外,我手边没有字典,所以我实际上不能呃...运行这个,但概念很合理。

此外,我并不完全相信对键列表进行排序是值得的。虽然python排序算法工作得非常快,但它仍然处理一个大的列表,并且可能会有相当大的成本。一个理想的算法可能必须考虑这个成本并决定它是否值得(可能不是)。如果没有对列表进行排序,您可能希望第一遍只考虑某个最小长度的单词 - 甚至可能是更大循环的一部分。在计算任何与解决方案无关的单词时,没有任何意义。