Question

在我的系统设计中，我进入了比较多个.txt文件的步骤。

目标：对于一次使用一个 .txt 文件中的每个单词，我必须搜索该单词是否出现在其他剩余的 .txt 文件中。如果需要的话，我必须将其从所有 .txt 文件中删除，包括我将首先进行迭代以进行比较的文件。

为了使自己更加清楚，我必须维持所有文件中的单词内容。（我只需要从相应的 .txt 文件中删除类似的单词）

我该怎么做？

.txt 文件格式

word 1
word 2
word 3
.
.

操作后，所需的文件将如下所示：

word 1
(removed, say this word occurred in other files) [It should be removed form all the .txt files]
word 3
.
.

Answer 1

以下是一种将文件加载到集合中，对集合进行迭代，查找重复项并将其删除的解决方案。

FILES = ['a.txt', 'b.txt', 'c.txt']

FILE_WORDS = []

WORDS_INDEX = dict()

for txt_file in FILES:
    WORDS_INDEX[txt_file] = {}
    FILE_WORDS.append(set())
    with open(txt_file, 'r') as f:
        ordered_words = [(w.strip(), idx,) for idx, w in enumerate(f.readlines())]
        for word_tuple in ordered_words:
            WORDS_INDEX[txt_file][word_tuple[0]] = word_tuple[1]
            FILE_WORDS[-1].add(word_tuple[0])

words_to_remove = set()
for idx, set_of_words in enumerate(FILE_WORDS):
    for word in set_of_words:
        for offset in range(0, len(FILE_WORDS)):
            if offset != idx:
                if word in FILE_WORDS[offset]:
                    FILE_WORDS[offset].remove(word)
                    words_to_remove.add((idx, word))

for entry in words_to_remove:
    FILE_WORDS[entry[0]].remove(entry[1])

for idx, set_of_words in enumerate(FILE_WORDS):
    print('The words left in file {} are:'.format(FILES[idx]))
    for word in set_of_words:
        print('\tWord "{}" is in index {}'.format(word,WORDS_INDEX[FILES[idx]][word]))

a.txt

zoo
gun
apple

b.txt

zoo
desk
apple

c.txt

dog
tv
home
desk
apple

输出

The words left in file a.txt are:
    Word "gun" is in index 1
The words left in file b.txt are:
The words left in file c.txt are:
    Word "tv" is in index 1
    Word "home" is in index 2
    Word "dog" is in index 0

比较多个文本文件

1 个答案: