如何在长随机字符串中找到可能的英语单词?

时间:2013-10-12 19:10:43

标签: python dictionary information-retrieval trie

我正在做一个艺术项目,我想看看是否有任何信息来自一长串字符(~28,000)。这有点像解决混杂问题时面临的问题。这是一个片段:

  

jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf

在此字符串中搜索嵌入(向前和向后)所有可能的英语单词的最有效方法是什么?

检查子字符串的有用词典是什么?做这种事情有没有一个好的图书馆?我四处搜寻,发现了一些有趣的TRIE解决方案;但是他们中的大多数都在处理你事先知道一组单词的情况。

3 个答案:

答案 0 :(得分:9)

我使用这个解决方案,在一个包含100,000字的字典中,在.5秒内从28,000个随机字符的语料库中找到所有单词。它在O(n)时间内运行。它需要一个名为“words.txt”的文件,这是一个字典,其中的单词由某种空格分隔。我在/usr/share/dict/words中使用了默认的unix wordlist,但我确信你可以在网上找到大量的文本文件词典,如果没有的话。

from random import choice
import string

dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words

text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself

words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
    chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
    for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
        word = chunk[:j] #subchunk
        if word in dictionary: #constant time hash lookup if it's in dictionary
            words_found.add(word) #add to set of words

print words_found

答案 1 :(得分:1)

这是一个应该有用的二分/二分搜索。

def isaprefix(frag, wordlist, first, last):
    """
    Recursive binary search of wordlist for words that start with frag.

    assumes wordlist is a sorted list
    typically called with first = 0 and last = len(wordlist)

    first,last -->> integer
    returns bool
    """

    # base case - down to two elements
    if (last - first) < 2:
        # return False unless frag is a prefix
        # of either of the two remaining words
        return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)

    #mid = (first + last)/2
    midword = wordlist[(first + last) / 2]

    # go ahead and return if you find one
    # a second base case?
    if midword.startswith(frag):
        return True

    #print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
    # start the tests
    # python does just fine comparing strings
    if frag < midword:
        # set the limits to the lower half
        # of the previous range searched and recurse
        return isaprefix(frag, wordlist, first, mid-1)

    # frag is > midword: set the limits to the upper half
    # of the previous range searched and recurse
    return isaprefix(frag, wordlist, mid+1, last)

答案 2 :(得分:0)

您可以考虑从整个字典中创建一个序列,然后将它们对齐以使用史密斯水人或任何启发式局部对齐算法获取序列中的单词