Question

我有两个文本文档，基本上包含所有相同的单词，但也有一些例外。如何在document2中找到文档1中不在任何位置的单词并将其打印出来？例如：

文档1：＆＃34;你好，你好吗＆＃34;

文件2：＆＃34;你今天好吗约翰＆＃34;

期望的输出：＆＃34;嗨今天约翰＆＃34;

编辑：我想打印仅存在于document2中的单词，而不是在document1中的任何位置找到。我不想打印他们之间相同的单词。

我创建了这个代码，我认为它找到了两个文本文件之间的匹配，这不是我想要它做的事情：

doc1 = open("K:\System Files\Desktop\document1.txt", "r+")
doc2 = open("K:\System Files\Desktop\document2.txt", "r+")

list1 = []
list2 = []

for i in doc1: #Removes the new line after each word
    i = i[:-1]
    list1.append(i)
for i in doc2:
    i = i[:-1]
    list2.append(i)

for i in list1:
    for j in list2:
        if i == j:
            print(i)

Answer 1

如果您不担心单词的顺序，那么您可以使用集合来完成此操作，如下所示：

import re

def get_words(filename):
    with open(filename, 'r') as f_input:
        return set(w.lower() for w in re.findall(r'(\w+)', f_input.read()))

words1 = get_words('document1.txt')
words2 = get_words('document2.txt')

print words2 - words1

这会显示：

set(['john', 'hi', 'today'])

在两个集上使用-可以在两个集合中为您提供difference。

如何查找两个文本文件之间不一样的单词

1 个答案: