编辑:
我得到奇怪结果的原因是我正在使用的字典(https://github.com/dwyl/english-words/blob/master/words_alpha.txt)包含了很多不是单词的值。我下面所有的代码都可以正常工作。我以为是因为if word in words
行,但是我错了
这是我的代码:
cipher = (input('what is your cipher? '))
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
shift = 0
score=0
answer=''
scores=[]
answers=[]
with open('smalldic.txt') as word_file:
words2 = set(word_file.read().lower().split())
with open('bigdic.txt') as word_file:
words = set(word_file.read().split())
while shift<26:
shift+=1
for letter in cipher:
try:
answer+=alphabet[(alphabet.index(letter)+shift)%26]
except ValueError:
answer+=letter
answer = answer.split()
for word in answer:
if word in words:
score+=len(word)*13
if word in words2:
score+=len(word)*26
scores.append(score)
answers.append(answer)
answer=''
score=0
maxscore=max(scores)
count=-1
for i in scores:
count+=1
if i==maxscore:
print(i)
print(answers[count])
pause=input('Press any key to finish')
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>>
= RESTART: C:\Program Files (x86)\Python37-32\Scripts\caesarcipherdecoder.py =
what is your cipher? this is the result
['uijt', 'jt', 'uif', 'sftvmu']
['uijt', 'jt', 'uif', 'sftvmu']
['vjku', 'ku', 'vjg', 'tguwnv']
['vjku', 'ku', 'vjg', 'tguwnv']
['wklv', 'lv', 'wkh', 'uhvxow']
['wklv', 'lv', 'wkh', 'uhvxow']
['xlmw', 'mw', 'xli', 'viwypx']
['xlmw', 'mw', 'xli', 'viwypx']
['ymnx', 'nx', 'ymj', 'wjxzqy']
['ymnx', 'nx', 'ymj', 'wjxzqy']
['znoy', 'oy', 'znk', 'xkyarz']
['znoy', 'oy', 'znk', 'xkyarz']
['aopz', 'pz', 'aol', 'ylzbsa']
['aopz', 'pz', 'aol', 'ylzbsa']
['bpqa', 'qa', 'bpm', 'zmactb']
['bpqa', 'qa', 'bpm', 'zmactb']
['cqrb', 'rb', 'cqn', 'anbduc']
['cqrb', 'rb', 'cqn', 'anbduc']
['drsc', 'sc', 'dro', 'bocevd']
['drsc', 'sc', 'dro', 'bocevd']
['estd', 'td', 'esp', 'cpdfwe']
['estd', 'td', 'esp', 'cpdfwe']
['ftue', 'ue', 'ftq', 'dqegxf']
['ftue', 'ue', 'ftq', 'dqegxf']
['guvf', 'vf', 'gur', 'erfhyg']
['guvf', 'vf', 'gur', 'erfhyg']
['hvwg', 'wg', 'hvs', 'fsgizh']
['hvwg', 'wg', 'hvs', 'fsgizh']
['iwxh', 'xh', 'iwt', 'gthjai']
['iwxh', 'xh', 'iwt', 'gthjai']
['jxyi', 'yi', 'jxu', 'huikbj']
['jxyi', 'yi', 'jxu', 'huikbj']
['kyzj', 'zj', 'kyv', 'ivjlck']
['kyzj', 'zj', 'kyv', 'ivjlck']
['lzak', 'ak', 'lzw', 'jwkmdl']
['lzak', 'ak', 'lzw', 'jwkmdl']
['mabl', 'bl', 'max', 'kxlnem']
['mabl', 'bl', 'max', 'kxlnem']
['nbcm', 'cm', 'nby', 'lymofn']
['nbcm', 'cm', 'nby', 'lymofn']
['ocdn', 'dn', 'ocz', 'mznpgo']
['ocdn', 'dn', 'ocz', 'mznpgo']
['pdeo', 'eo', 'pda', 'naoqhp']
['pdeo', 'eo', 'pda', 'naoqhp']
['qefp', 'fp', 'qeb', 'obpriq']
['qefp', 'fp', 'qeb', 'obpriq']
['rfgq', 'gq', 'rfc', 'pcqsjr']
['rfgq', 'gq', 'rfc', 'pcqsjr']
['sghr', 'hr', 'sgd', 'qdrtks']
['sghr', 'hr', 'sgd', 'qdrtks']
['this', 'is', 'the', 'result']
['this', 'is', 'the', 'result']
[1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 2, 4]
答案 0 :(得分:0)
您的代码对我来说很好用。 您确定问题不在柜台上吗? 下面的代码对我来说返回“二”,就像应该这样:
answer = ['j', 'mpwf', 'taub', 'tubdl', 'tuba', 'pwfsgmpx', 'apple']
words = {'jam', 'jelly', 'tuba', 'apple'}
score = 0
for word in answer:
if word in words:
score += 1
print(score)
答案 1 :(得分:0)
roganjosh观察到,您描述的行为没有发生。
您提供了两个字母的输入词'it'
。我的词典列出了160种“有效”的两个字母的组合,几乎是676种可能组合的四分之一。我不知道您使用了什么输入词典,但是这种效果可能会导致产生大量1
得分。例如,我注意到“ mw”可能对应于兆瓦,再加上我在输出中看到一些ISO-3166两个字母的国家/地区代码。我使用的字典是OS / X随附的/usr/share/dict/words
。
要调试,只需在增加分数后使用print语句即可:
for word in answer:
if word in words:
score += 1
print(word)
这将突出显示“令人惊讶的” word
值。
Python的in
运算符的行为恰好是as documented。
编辑:
人们通常会编译单词列表以支持拼写检查应用程序,这会倾向于在包含一切的一侧犯错。片刻的搜索会产生很多单词列表,但是我从infochimps抓到的第一个单词列表却有427个两个字母的单词,占了令人印象深刻的63%。也许SCOWL会证明是相关的。
您可能想使用随附的(平台无关)代码来访问相当合理的英语单词库。
#! /usr/bin/env python
# You will need: pip install pyenchant
import enchant
def letters():
return range(ord('a'), ord('z') + 1)
def get_2_letter_words():
for a in letters():
for b in letters():
yield chr(a) + chr(b)
def num_valid_2_letter_words():
d = enchant.Dict("en_US")
return sum(d.check(word) for word in get_2_letter_words())
if __name__ == '__main__':
n = num_valid_2_letter_words()
print(n, n / 26 ** 2)
在这里您真正想要的是unigram频率。也就是说,与其基于某个布尔check()
函数获得两个字母的单词而不是获胜,您不如给像'it'
这样的普通单词一个较高的分数,而为一个{不太常见的单词,例如'id'
或'mw'
。
我更喜欢Dunes的建议,即更多地注意较长的单词。假设我们缺少unigram频率数,因此被迫在n个字母的单词上假设一个统一的先验,例如'it'
和'id'
同样有可能以明文形式出现。计算字典中n个字母的单词数,除以26 ** n
,然后在分数中使用该分数。
Levenshtein distance与.suggest()
结合使用将提高对纯文本中拼写错误的适应性。