如何检查字符串是否在集合中

时间:2019-02-23 17:10:35

标签: python

编辑: 我得到奇怪结果的原因是我正在使用的字典(https://github.com/dwyl/english-words/blob/master/words_alpha.txt)包含了很多不是单词的值。我下面所有的代码都可以正常工作。我以为是因为if word in words行,但是我错了

这是我的代码:

cipher = (input('what is your cipher? '))
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
shift = 0
score=0
answer=''
scores=[]
answers=[]
with open('smalldic.txt') as word_file:
    words2 = set(word_file.read().lower().split())
with open('bigdic.txt') as word_file:
    words = set(word_file.read().split()) 
while shift<26:                           
    shift+=1
    for letter in cipher:                 
        try:
            answer+=alphabet[(alphabet.index(letter)+shift)%26]
        except ValueError:
            answer+=letter
    answer = answer.split()
    for word in answer:
        if word in words:
            score+=len(word)*13
            if word in words2:
                score+=len(word)*26           
    scores.append(score)
    answers.append(answer)
    answer=''
    score=0
maxscore=max(scores)
count=-1
for i in scores:
    count+=1
    if i==maxscore:
        print(i)
        print(answers[count])
pause=input('Press any key to finish')

Shell

Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> 
= RESTART: C:\Program Files (x86)\Python37-32\Scripts\caesarcipherdecoder.py =
what is your cipher? this is the result
['uijt', 'jt', 'uif', 'sftvmu']
['uijt', 'jt', 'uif', 'sftvmu']
['vjku', 'ku', 'vjg', 'tguwnv']
['vjku', 'ku', 'vjg', 'tguwnv']
['wklv', 'lv', 'wkh', 'uhvxow']
['wklv', 'lv', 'wkh', 'uhvxow']
['xlmw', 'mw', 'xli', 'viwypx']
['xlmw', 'mw', 'xli', 'viwypx']
['ymnx', 'nx', 'ymj', 'wjxzqy']
['ymnx', 'nx', 'ymj', 'wjxzqy']
['znoy', 'oy', 'znk', 'xkyarz']
['znoy', 'oy', 'znk', 'xkyarz']
['aopz', 'pz', 'aol', 'ylzbsa']
['aopz', 'pz', 'aol', 'ylzbsa']
['bpqa', 'qa', 'bpm', 'zmactb']
['bpqa', 'qa', 'bpm', 'zmactb']
['cqrb', 'rb', 'cqn', 'anbduc']
['cqrb', 'rb', 'cqn', 'anbduc']
['drsc', 'sc', 'dro', 'bocevd']
['drsc', 'sc', 'dro', 'bocevd']
['estd', 'td', 'esp', 'cpdfwe']
['estd', 'td', 'esp', 'cpdfwe']
['ftue', 'ue', 'ftq', 'dqegxf']
['ftue', 'ue', 'ftq', 'dqegxf']
['guvf', 'vf', 'gur', 'erfhyg']
['guvf', 'vf', 'gur', 'erfhyg']
['hvwg', 'wg', 'hvs', 'fsgizh']
['hvwg', 'wg', 'hvs', 'fsgizh']
['iwxh', 'xh', 'iwt', 'gthjai']
['iwxh', 'xh', 'iwt', 'gthjai']
['jxyi', 'yi', 'jxu', 'huikbj']
['jxyi', 'yi', 'jxu', 'huikbj']
['kyzj', 'zj', 'kyv', 'ivjlck']
['kyzj', 'zj', 'kyv', 'ivjlck']
['lzak', 'ak', 'lzw', 'jwkmdl']
['lzak', 'ak', 'lzw', 'jwkmdl']
['mabl', 'bl', 'max', 'kxlnem']
['mabl', 'bl', 'max', 'kxlnem']
['nbcm', 'cm', 'nby', 'lymofn']
['nbcm', 'cm', 'nby', 'lymofn']
['ocdn', 'dn', 'ocz', 'mznpgo']
['ocdn', 'dn', 'ocz', 'mznpgo']
['pdeo', 'eo', 'pda', 'naoqhp']
['pdeo', 'eo', 'pda', 'naoqhp']
['qefp', 'fp', 'qeb', 'obpriq']
['qefp', 'fp', 'qeb', 'obpriq']
['rfgq', 'gq', 'rfc', 'pcqsjr']
['rfgq', 'gq', 'rfc', 'pcqsjr']
['sghr', 'hr', 'sgd', 'qdrtks']
['sghr', 'hr', 'sgd', 'qdrtks']
['this', 'is', 'the', 'result']
['this', 'is', 'the', 'result']
[1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 2, 4]

2 个答案:

答案 0 :(得分:0)

您的代码对我来说很好用。 您确定问题不在柜台上吗? 下面的代码对我来说返回“二”,就像应该这样:

answer = ['j', 'mpwf', 'taub', 'tubdl', 'tuba', 'pwfsgmpx', 'apple']
 words = {'jam', 'jelly', 'tuba', 'apple'}
 score = 0
 for word in answer:
     if word in words:
         score += 1
 print(score)

答案 1 :(得分:0)

roganjosh观察到,您描述的行为没有发生。

您提供了两个字母的输入词'it'。我的词典列出了160种“有效”的两个字母的组合,几乎是676种可能组合的四分之一。我不知道您使用了什么输入词典,但是这种效果可能会导致产生大量1得分。例如,我注意到“ mw”可能对应于兆瓦,再加上我在输出中看到一些ISO-3166两个字母的国家/地区代码。我使用的字典是OS / X随附的/usr/share/dict/words

要调试,只需在增加分数后使用print语句即可:

    for word in answer:
        if word in words:
            score += 1
            print(word)

这将突出显示“令人惊讶的” word值。

Python的in运算符的行为恰好是as documented

编辑:

人们通常会编译单词列表以支持拼写检查应用程序,这会倾向于在包含一切的一侧犯错。片刻的搜索会产生很多单词列表,但是我从infochimps抓到的第一个单词列表却有427个两个字母的单词,占了令人印象深刻的63%。也许SCOWL会证明是相关的。

您可能想使用随附的(平台无关)代码来访问相当合理的英语单词库。

#! /usr/bin/env python

# You will need: pip install pyenchant
import enchant


def letters():
    return range(ord('a'), ord('z') + 1)


def get_2_letter_words():
    for a in letters():
        for b in letters():
            yield chr(a) + chr(b)


def num_valid_2_letter_words():
    d = enchant.Dict("en_US")
    return sum(d.check(word) for word in get_2_letter_words())


if __name__ == '__main__':
    n = num_valid_2_letter_words()
    print(n, n / 26 ** 2)

在这里您真正想要的是unigram频率。也就是说,与其基于某个布尔check()函数获得两个字母的单词而不是获胜,您不如给像'it'这样的普通单词一个较高的分数,而为一个{不太常见的单词,例如'id''mw'

我更喜欢Dunes的建议,即更多地注意较长的单词。假设我们缺少unigram频率数,因此被迫在n个字母的单词上假设一个统一的先验,例如'it''id'同样有可能以明文形式出现。计算字典中n个字母的单词数,除以26 ** n,然后在分数中使用该分数。

Levenshtein distance.suggest()结合使用将提高对纯文本中拼写错误的适应性。