Python:自动更正

时间:2014-04-25 11:50:36

标签: python python-2.7

我有两个文件check.txt和orig.txt。我想检查check.txt中的每个单词,看看它是否与orig.txt中的任何单词匹配。如果匹配则代码应该用第一个匹配替换该字,否则它应该保留字。但不知何故,它不按要求工作。请帮助。

check.txt如下所示:

ukrain

troop

force

和orig.txt看起来像是:

ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball

rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou

russia 'outraged' at deadly shootout in east #ukraine -  moscow:... http://t.co/nqim7uk7zg
 #groundtroops #russianpresidentvladimirputin

http://pastebin.com/XJeDhY3G

f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')

for word in f:
    for line in orig:
        for word2 in line.split(" "):
            word2 = word2.lower()            
            if word in word2:
                word = word2
            else:
                print('not found')
        new.write(word)

1 个答案:

答案 0 :(得分:1)

您的代码存在两个问题:

  1. 当您循环显示f中的字词时,每个字词仍会有新的字符,因此您的in支票不起作用
  2. 您希望为orig中的每个单词迭代f,但文件是迭代器,在f
  3. 中的第一个单词后用尽

    您可以通过执行word = word.strip()orig = list(orig)来解决这些问题,或者您可以尝试这样的事情:

    # get all stemmed words
    stemmed = [line.strip() for line in f]
    # set of lowercased original words
    original = set(word.lower() for line in orig for word in line.split())
    # map stemmed words to unstemmed words
    unstemmed = {word: None for word in stemmed}
    # find original words for word stems in map
    for stem in unstemmed:
        for word in original:
            if stem in word:
                unstemmed[stem] = word
    print unstemmed
    

    或更短(没有最后的双循环),使用difflib,如评论中所示:

    unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}
    

    此外,请记住close您的文件,或使用with关键字自动关闭它们。