在两个文件中找到重复的单词

时间:2016-01-04 10:22:59

标签: python file duplicates

我有两个文本文件。我需要检查里面的重复单词。有没有比这段代码更简洁的方法?

file1 = set(line.strip() for line in open('/home/user1/file1.txt'))
file2 = set(line.strip() for line in open('/home/user1/file2.txt'))

for line in file1 & file2:
    if line:
        print(line)

3 个答案:

答案 0 :(得分:2)

您可以编写简洁的代码,但更重要的是,您不需要创建两个集合,您可以使用set.intersection,这将允许您的代码适用于更大的数据集并且运行得更快:

with open('/home/user1/file1.txt') as f1,  open('/home/user1/file2.txt') as f2:
    for line in set(map(str.rstrip,f2)).intersection(map(str.rstrip,f2))):
        print(line)

对于python2使用itertools.imap

from itertools import imap
with open('/home/user1/file1.txt') as f1,  open('/home/user1/file2.txt') as f2:
    for line in set(imap(str.rstrip,f2)).intersection(imap(str.rstrip(f2))):
        print(line)

你创建一个单独的集合,然后添加迭代传递的迭代,即文件2的str.rstripped行作为目标,首先创建两个完整的行集,然后进行交集。

答案 1 :(得分:0)

这一行更短并在使用后关闭两个文件:

with open('/home/user1/file1.txt') as file1, open('/home/user1/file2.txt') as file2:
    for line in set(line.strip() for line in file1) & set(line.strip() for line in file2):
        if line: 
            print(line)

只有一组的变化:

with open('/home/user1/file1.txt') as file1, open('/home/user1/file2.txt') as file2:
    for line in set(line.strip() for line in file1).intersection(line.strip() for line in 
                                                                 file2):
        if line: 
            print(line)

答案 2 :(得分:0)

更短:

with open('/home/user/file1.txt') as file1, open('/home/user/file2.txt') as file2:
    print "".join([word+"\n" for word in set(file1.read().split()) & set(file2.read().split())])