使用二进制文件在循环中读取文件时Python性能不佳

时间:2018-04-12 14:52:06

标签: python performance

我遍历文件夹和所有子目录中的文件,然后获取文件的哈希值。似乎很基本。

在此示例中,总共有1036个文件。当我运行它时,将有75,000(左右)文件

代码在迭代中按预期飞行,直到我们进入文件计数器的后期900s。

这个问题很容易被隔离到二进制读取。读取的块大小(1024 vs 4096 vs 65536)确实有所不同

如果我从CMD线路运行,代码会快速执行前950行(约2秒),然后暂停各种文件(不是一直都是同一个),平均需要20秒才能完成文件75个文件。

在PyCharm中,迭代1036个文件大约需要2:20(严重)

这段代码非常简单。问题是什么?

(请原谅非正统的pythonic命名标准)

import os
import hashlib
from datetime import datetime


def hash_file(filename):
    # make a hash object
    # the chance of a hash collision with sha1 or MD5 is minimal but I'm using sha256 because memory is plentiful
    h = hashlib.sha256()

    try:
        # open file for reading in binary mode - binary mode is a must
        with open(filename,'rb') as file:
            # loop till the end of the file
            fileChunk = None
            while fileChunk != b'':
                # read only 1024 bytes at a time
                #fileChunk = file.read(1024)
                # read only 4096 bytes at a time
                #fileChunk = file.read(4096)
                # read only 65536 bytes at a time
                fileChunk = file.read(65536)

                h.update(fileChunk)

        # the entire absolute path and file name
        print (os.path.realpath(file.name))
        # e.g exampe "/Users/testuser/PycharmProjects/test2/Examples/brianshirt_1.jpg"

        # return the hex representation of digest
        return h.hexdigest()

    except IOError as ioe:
        print("\n\nfile %s could not be opened" % filename)
        return "Error"

    finally:
        file.close()



startTime = datetime.now()
fileCounter = 0
#rootDir = '.'
rootDir = '/Users/testuser/Downloads'
for dirName, subdirList, fileList in os.walk(rootDir, topdown=False):
    print('Found directory: %s' % dirName)

    for fname in fileList:
        fileCounter += 1
        temp1 = str(fname) + ":" + str(hash_file(dirName+"/"+fname)) + " counter: " + str(fileCounter)


endTime = datetime.now()

totalTime = endTime - startTime
print ("Count of files hashed: " + str(fileCounter))
print ('Scanning Completed in: '+ str(totalTime))

0 个答案:

没有答案