在我的系统设计中,我进入了比较多个.txt文件的步骤。
目标:对于一次使用一个 .txt 文件中的每个单词,我必须搜索该单词是否出现在其他剩余的 .txt 文件中。如果需要的话,我必须将其从所有 .txt 文件中删除,包括我将首先进行迭代以进行比较的文件。
为了使自己更加清楚,我必须维持所有文件中的单词内容。 (我只需要从相应的 .txt 文件中删除类似的单词)
我该怎么做?
.txt 文件格式
word 1
word 2
word 3
.
.
操作后,所需的文件将如下所示:
word 1
(removed, say this word occurred in other files) [It should be removed form all the .txt files]
word 3
.
.
答案 0 :(得分:1)
以下是一种将文件加载到集合中,对集合进行迭代,查找重复项并将其删除的解决方案。
FILES = ['a.txt', 'b.txt', 'c.txt']
FILE_WORDS = []
WORDS_INDEX = dict()
for txt_file in FILES:
WORDS_INDEX[txt_file] = {}
FILE_WORDS.append(set())
with open(txt_file, 'r') as f:
ordered_words = [(w.strip(), idx,) for idx, w in enumerate(f.readlines())]
for word_tuple in ordered_words:
WORDS_INDEX[txt_file][word_tuple[0]] = word_tuple[1]
FILE_WORDS[-1].add(word_tuple[0])
words_to_remove = set()
for idx, set_of_words in enumerate(FILE_WORDS):
for word in set_of_words:
for offset in range(0, len(FILE_WORDS)):
if offset != idx:
if word in FILE_WORDS[offset]:
FILE_WORDS[offset].remove(word)
words_to_remove.add((idx, word))
for entry in words_to_remove:
FILE_WORDS[entry[0]].remove(entry[1])
for idx, set_of_words in enumerate(FILE_WORDS):
print('The words left in file {} are:'.format(FILES[idx]))
for word in set_of_words:
print('\tWord "{}" is in index {}'.format(word,WORDS_INDEX[FILES[idx]][word]))
a.txt
zoo
gun
apple
b.txt
zoo
desk
apple
c.txt
dog
tv
home
desk
apple
输出
The words left in file a.txt are:
Word "gun" is in index 1
The words left in file b.txt are:
The words left in file c.txt are:
Word "tv" is in index 1
Word "home" is in index 2
Word "dog" is in index 0