比较多个文本文件

时间:2019-03-01 18:55:33

标签: python

在我的系统设计中,我进入了比较多个.txt文件的步骤。

目标:对于一次使用一个 .txt 文件中的每个单词,我必须搜索该单词是否出现在其他剩余的 .txt 文件中。如果需要的话,我必须将其从所有 .txt 文件中删除,包括我将首先进行迭代以进行比较的文件。

为了使自己更加清楚,我必须维持所有文件中的单词内容。 (我只需要从相应的 .txt 文件中删除类似的单词)

我该怎么做?

.txt 文件格式

word 1
word 2
word 3
.
.

操作后,所需的文件将如下所示:

word 1
(removed, say this word occurred in other files) [It should be removed form all the .txt files]
word 3
.
.

1 个答案:

答案 0 :(得分:1)

以下是一种将文件加载到集合中,对集合进行迭代,查找重复项并将其删除的解决方案。

FILES = ['a.txt', 'b.txt', 'c.txt']

FILE_WORDS = []

WORDS_INDEX = dict()

for txt_file in FILES:
    WORDS_INDEX[txt_file] = {}
    FILE_WORDS.append(set())
    with open(txt_file, 'r') as f:
        ordered_words = [(w.strip(), idx,) for idx, w in enumerate(f.readlines())]
        for word_tuple in ordered_words:
            WORDS_INDEX[txt_file][word_tuple[0]] = word_tuple[1]
            FILE_WORDS[-1].add(word_tuple[0])

words_to_remove = set()
for idx, set_of_words in enumerate(FILE_WORDS):
    for word in set_of_words:
        for offset in range(0, len(FILE_WORDS)):
            if offset != idx:
                if word in FILE_WORDS[offset]:
                    FILE_WORDS[offset].remove(word)
                    words_to_remove.add((idx, word))

for entry in words_to_remove:
    FILE_WORDS[entry[0]].remove(entry[1])

for idx, set_of_words in enumerate(FILE_WORDS):
    print('The words left in file {} are:'.format(FILES[idx]))
    for word in set_of_words:
        print('\tWord "{}" is in index {}'.format(word,WORDS_INDEX[FILES[idx]][word]))

a.txt

zoo
gun
apple

b.txt

zoo
desk
apple

c.txt

dog
tv
home
desk
apple

输出

The words left in file a.txt are:
    Word "gun" is in index 1
The words left in file b.txt are:
The words left in file c.txt are:
    Word "tv" is in index 1
    Word "home" is in index 2
    Word "dog" is in index 0