我遍历文件夹和所有子目录中的文件,然后获取文件的哈希值。似乎很基本。
在此示例中,总共有1036个文件。当我运行它时,将有75,000(左右)文件
代码在迭代中按预期飞行,直到我们进入文件计数器的后期900s。
这个问题很容易被隔离到二进制读取。读取的块大小(1024 vs 4096 vs 65536)确实有所不同
如果我从CMD线路运行,代码会快速执行前950行(约2秒),然后暂停各种文件(不是一直都是同一个),平均需要20秒才能完成文件75个文件。
在PyCharm中,迭代1036个文件大约需要2:20(严重)
这段代码非常简单。问题是什么?
(请原谅非正统的pythonic命名标准)
import os
import hashlib
from datetime import datetime
def hash_file(filename):
# make a hash object
# the chance of a hash collision with sha1 or MD5 is minimal but I'm using sha256 because memory is plentiful
h = hashlib.sha256()
try:
# open file for reading in binary mode - binary mode is a must
with open(filename,'rb') as file:
# loop till the end of the file
fileChunk = None
while fileChunk != b'':
# read only 1024 bytes at a time
#fileChunk = file.read(1024)
# read only 4096 bytes at a time
#fileChunk = file.read(4096)
# read only 65536 bytes at a time
fileChunk = file.read(65536)
h.update(fileChunk)
# the entire absolute path and file name
print (os.path.realpath(file.name))
# e.g exampe "/Users/testuser/PycharmProjects/test2/Examples/brianshirt_1.jpg"
# return the hex representation of digest
return h.hexdigest()
except IOError as ioe:
print("\n\nfile %s could not be opened" % filename)
return "Error"
finally:
file.close()
startTime = datetime.now()
fileCounter = 0
#rootDir = '.'
rootDir = '/Users/testuser/Downloads'
for dirName, subdirList, fileList in os.walk(rootDir, topdown=False):
print('Found directory: %s' % dirName)
for fname in fileList:
fileCounter += 1
temp1 = str(fname) + ":" + str(hash_file(dirName+"/"+fname)) + " counter: " + str(fileCounter)
endTime = datetime.now()
totalTime = endTime - startTime
print ("Count of files hashed: " + str(fileCounter))
print ('Scanning Completed in: '+ str(totalTime))