打印共享确切内容的文件

时间:2018-03-29 19:16:47

标签: python itertools hashlib listdir

我的目录有数百个文件,其中一些名称不同,但内容重复。我已将数组分组并执行以下操作:

import os 
import itertools
import hashlib 
directory = os.listdir(input())
  for collection1, collection2 in itertools.combinations (directory, 2): 

    def check(data):
      data_check = hashlib.md5()
      data_check.update(open(data).read())
      return data_check.hexdigest()

    def match_check(c1, c2):
      return check(c1) == check(c2) 

match_check(collection1,collection2)

1 个答案:

答案 0 :(得分:0)

您可以使用dict作为密钥使用MD5。例如:

files = {}

# In the loop:
  sum = hashlib.md5(open(data].read())
  if sum in files:
    # A file already exists for this MD5 sum, append the file
    files[sum].append(data)
  else:
    # First file with this MD5 sum
    files[sum] = [data]

然后,您可以列出共享相同索引的dict的值。例如:

for sum, l in files.values():
  if l.length() > 1:
    # More than one file with the same MD5 file
    # Do something