Question

我遍历文件夹和所有子目录中的文件，然后获取文件的哈希值。似乎很基本。

在此示例中，总共有1036个文件。当我运行它时，将有75,000（左右）文件

代码在迭代中按预期飞行，直到我们进入文件计数器的后期900s。

这个问题很容易被隔离到二进制读取。读取的块大小（1024 vs 4096 vs 65536）确实有所不同

如果我从CMD线路运行，代码会快速执行前950行（约2秒），然后暂停各种文件（不是一直都是同一个），平均需要20秒才能完成文件75个文件。

在PyCharm中，迭代1036个文件大约需要2:20（严重）

这段代码非常简单。问题是什么？

（请原谅非正统的pythonic命名标准）

import os
import hashlib
from datetime import datetime


def hash_file(filename):
    # make a hash object
    # the chance of a hash collision with sha1 or MD5 is minimal but I'm using sha256 because memory is plentiful
    h = hashlib.sha256()

    try:
        # open file for reading in binary mode - binary mode is a must
        with open(filename,'rb') as file:
            # loop till the end of the file
            fileChunk = None
            while fileChunk != b'':
                # read only 1024 bytes at a time
                #fileChunk = file.read(1024)
                # read only 4096 bytes at a time
                #fileChunk = file.read(4096)
                # read only 65536 bytes at a time
                fileChunk = file.read(65536)

                h.update(fileChunk)

        # the entire absolute path and file name
        print (os.path.realpath(file.name))
        # e.g exampe "/Users/testuser/PycharmProjects/test2/Examples/brianshirt_1.jpg"

        # return the hex representation of digest
        return h.hexdigest()

    except IOError as ioe:
        print("\n\nfile %s could not be opened" % filename)
        return "Error"

    finally:
        file.close()



startTime = datetime.now()
fileCounter = 0
#rootDir = '.'
rootDir = '/Users/testuser/Downloads'
for dirName, subdirList, fileList in os.walk(rootDir, topdown=False):
    print('Found directory: %s' % dirName)

    for fname in fileList:
        fileCounter += 1
        temp1 = str(fname) + ":" + str(hash_file(dirName+"/"+fname)) + " counter: " + str(fileCounter)


endTime = datetime.now()

totalTime = endTime - startTime
print ("Count of files hashed: " + str(fileCounter))
print ('Scanning Completed in: '+ str(totalTime))

使用二进制文件在循环中读取文件时Python性能不佳

0 个答案: