我有两个文件check.txt和orig.txt。我想检查check.txt中的每个单词,看看它是否与orig.txt中的任何单词匹配。如果匹配则代码应该用第一个匹配替换该字,否则它应该保留字。但不知何故,它不按要求工作。请帮助。
check.txt如下所示:
ukrain
troop
force
和orig.txt看起来像是:
ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball
rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou
russia 'outraged' at deadly shootout in east #ukraine - moscow:... http://t.co/nqim7uk7zg
#groundtroops #russianpresidentvladimirputin
f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')
for word in f:
for line in orig:
for word2 in line.split(" "):
word2 = word2.lower()
if word in word2:
word = word2
else:
print('not found')
new.write(word)
答案 0 :(得分:1)
您的代码存在两个问题:
f
中的字词时,每个字词仍会有新的字符,因此您的in
支票不起作用orig
中的每个单词迭代f
,但文件是迭代器,在f
您可以通过执行word = word.strip()
和orig = list(orig)
来解决这些问题,或者您可以尝试这样的事情:
# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
for word in original:
if stem in word:
unstemmed[stem] = word
print unstemmed
或更短(没有最后的双循环),使用difflib
,如评论中所示:
unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}
此外,请记住close
您的文件,或使用with
关键字自动关闭它们。