Question

我有两个文件check.txt和orig.txt。我想检查check.txt中的每个单词，看看它是否与orig.txt中的任何单词匹配。如果匹配则代码应该用第一个匹配替换该字，否则它应该保留字。但不知何故，它不按要求工作。请帮助。

check.txt如下所示：

ukrain

troop

force

和orig.txt看起来像是：

ukraine cnn should stop pretending &amp; announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball

rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou

russia 'outraged' at deadly shootout in east #ukraine -  moscow:... http://t.co/nqim7uk7zg
 #groundtroops #russianpresidentvladimirputin

http://pastebin.com/XJeDhY3G

f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')

for word in f:
    for line in orig:
        for word2 in line.split(" "):
            word2 = word2.lower()            
            if word in word2:
                word = word2
            else:
                print('not found')
        new.write(word)

Answer 1

您的代码存在两个问题：

当您循环显示f中的字词时，每个字词仍会有新的字符，因此您的in支票不起作用
您希望为orig中的每个单词迭代f，但文件是迭代器，在f

您可以通过执行word = word.strip()和orig = list(orig)来解决这些问题，或者您可以尝试这样的事情：

# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
    for word in original:
        if stem in word:
            unstemmed[stem] = word
print unstemmed

或更短（没有最后的双循环），使用difflib，如评论中所示：

unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}

此外，请记住close您的文件，或使用with关键字自动关闭它们。

Python：自动更正

1 个答案: