非字符串搜索单词的非蛮力建议解决方案

时间:2018-12-08 02:43:23

标签: python python-3.x

我想知道您是否可以建议一种非强力解决方案的算法,该算法遍历字符串,将其分为两半,并检查字典/哈希中是否存在这两个半身?

例如,字符串“ peanutbutter”被分为“ peanut”和“ butter”(是的,里面还有其他词,但是出于示例目的,我们可以将这两个词结合使用)

以下是我想出的蛮力解决方案:

def break_into_spaces(S):
    i = 1
    while i < len(S):
        left = S[i:]
        right = S[:i]
        if left in DICTIONARY and right in DICTIONARY:
            print("Found them!")
        print("{} {}".format(right, left))
        i += 1


break_into_spaces("peanutbutter")

2 个答案:

答案 0 :(得分:1)

这不是一个完整的解决方案,但是一个好主意可能是将单词存储在字典中,例如,键是单词的长度,值是单词的集合。然后创建一个长度列表以对其进行迭代,而不是对输入单词(s)进行迭代,例如:

words = ['toothpaste',
         'hard-to-find',
         'economic',
         'point',
         'food',
         'seal',
         'outrageous',
         'motionless',
         'ice',
         'tow',
         'boot',
         'cruel',
         'peanut',
         'butter']

index = {}
for word in words:
    index.setdefault(len(word), set()).add(word)

lengths = sorted(index)

def break_into_spaces(s):
    s_length = len(s)
    for length in lengths:
        if length < s_length:
            left = s[length:]
            right = s[:length]

            if left in index[length] and s_length - length in index and right in index[s_length - length]:
                print("{} {}".format(right, left))
                print("Found them!")
        else:
            break


break_into_spaces('peanutbutter')

输出

peanut butter
Found them!

这样做确实可以节省时间:

  1. 它避免了您遍历整个输入单词,想象一下输入单词比字典中所有单词都短的情况,这将立即中断循环并且什么也不显示。
  2. 通过将单词存储在相同长度的集合中,您只需要检查是否存在相同长度的匹配单词,而不用对照所有单词。请注意,这可能是没有意义的,因为字典是哈希表,因此从理论上讲,对包含性的检查是O(1)

答案 1 :(得分:1)

我的选择:

wordlist = ['air', 'pack', 'port', 'hard', 'back', 'bag', 'disk', 'ground', 'play']
word = 'playground'

lenw, minlen = len(word), min([len(w) for w in wordlist])
pairs = [(word[:n], word[n:]) for n in range(1,lenw) if (n >= minlen and n < lenw-minlen+1) ]
found = False
for w1, w2 in pairs:
  if w1 in wordlist and w2 in wordlist:
    print('Found ' + word + ' as: ' + w1 + ' + ' + w2)
    found = True
    break
if not found: print('No words found')

#=> Found playground as: play + ground

pairs是一分为二的单词的映射,其中两个子单词不小于单词列表中的最小单词。这样可以减少查找次数。

打印以查看:

print(pairs)
#=> [('pla', 'yground'), ('play', 'ground'), ('playg', 'round'), ('playgr', 'ound'), ('playgro', 'und')]


我遇到大量单词的情况,我建议按字母开头(作为词汇)进行分组,然后仅查找单词字母与起始单词集之间的交集内的单词。这里不完整的代码:

letters = set(word)
print(letters) #=> {'r', 'a', 'u', 'g', 'l', 'n', 'd', 'o', 'y', 'p'}

alphabet = {}
for word in wordlist:
    alphabet.setdefault(word[0], set()).add(word)
print(alphabet)
#=> {'a': {'air'}, 'p': {'port', 'play', 'pack'}, 'h': {'hard'}, 'b': {'back', 'bag'}, 'd': {'disk'}, 'g': {'ground'}}

所以交集是:{'g', 'p', 'd', 'a'} 然后建立查找列表:

lookuplist = []
for i in intersection:
  for word in alphabet[i]:
    lookuplist.append(word)
lookuplist #=> ['air', 'disk', 'ground', 'port', 'pack', 'play']

因此使用lookuplist代替wordlist


使用一些方法在抽屉中下订单

def vocabulary(wordlist):
  res = {}
  for word in wordlist:
    res.setdefault(word[0], set()).add(word)
  return res

def lookuplist(vocabulary, word):
  vocabulary_alphabet = set(vocabulary.keys())
  word_letters = set(word)
  intersection = vocabulary_alphabet.intersection(word_letters)
  lookuplist = []
  for i in intersection:
    for word in vocabulary[i]:
      lookuplist.append(word)
  return lookuplist

def find_word(word, lookuplist):
  lenw, minlen = len(word), min([len(w) for w in lookuplist])
  pairs = [(word[:n], word[n:]) for n in range(1,lenw) if (n >= minlen and n < lenw-minlen+1) ]
  for w1, w2 in pairs:
    if w1 in lookuplist and w2 in lookuplist: return (word, w1, w2)
  return []

您可以使用以下方法:

wordlist = ['air', 'pack', 'port', 'hard', 'back', 'bag', 'disk', 'ground', 'play']
word = 'playground'

vocabulary = vocabulary(wordlist) # run once then store the result
lookuplist = lookuplist(vocabulary, word)
found_word = find_word(word, lookuplist)
print(found_word)
#=> ('playground', 'play', 'ground')