我的目录有数百个文件,其中一些名称不同,但内容重复。我已将数组分组并执行以下操作:
import os
import itertools
import hashlib
directory = os.listdir(input())
for collection1, collection2 in itertools.combinations (directory, 2):
def check(data):
data_check = hashlib.md5()
data_check.update(open(data).read())
return data_check.hexdigest()
def match_check(c1, c2):
return check(c1) == check(c2)
match_check(collection1,collection2)
答案 0 :(得分:0)
您可以使用dict
作为密钥使用MD5
。例如:
files = {}
# In the loop:
sum = hashlib.md5(open(data].read())
if sum in files:
# A file already exists for this MD5 sum, append the file
files[sum].append(data)
else:
# First file with this MD5 sum
files[sum] = [data]
然后,您可以列出共享相同索引的dict
的值。例如:
for sum, l in files.values():
if l.length() > 1:
# More than one file with the same MD5 file
# Do something