如何在处理大文件时报告进度?

时间:2012-02-29 18:05:33

标签: python multithreading logging command-line-interface

我如何每隔五秒报告一次处理多少文件?我想我需要线程,但它是如何控制的?

#!/bin/env python
# -*- coding: utf8 -*-

import os
import sys
import logging
import hashlib

logger = logging.getLogger()
FORMAT = "%(asctime)s %(levelname)s: %(message)s"
logging.basicConfig(format=FORMAT, level=logging.DEBUG, datefmt="%H:%M:%S")

class fileScanner:
  readBytes = 0
  lastReadBytes = 0
  fileSize = 0  
  reportSeconds = 5

  def scanFile(self, filePath):
    self.readBytes = 0
    self.lastReadBytes = 0

    logging.getLogger()
    self.fileSize = os.path.getsize(filePath)

    with open(filePath, 'rb') as f:
      m = hashlib.sha512()
      while True:
        data = f.read(1024)
        if not data:
          break
        self.readBytes += len(data)
        m.update(data)
      return m.hexdigest()
    raise IOError("Couldn't process file '%s'" % filePath)

  def reportProcess(self):
    logging.getLogger()
    percent = float((self.readBytes / self.fileSize) * 100)
    secAvg = (self.readBytes - self.lastReadBytes) / self.reportSeconds
    estimatedTime = (self.fileSize - self.readBytes) / secAvg
    logging.info("%s%% (%s / %s bytes) read in average of %s MB / sec. Estimated time left: %s seconds." % (percent, self.readBytes, self.fileSize, secAvg, estimatedTime))
    self.lastReadBytes = self.readBytes


if __name__ == "__main__":
  fs = fileScanner()
  hash = fs.scanfile('largefile.dat')

我如何开始和结束reportProcess()?

是的我知道那里的计算可能是错误的。

2 个答案:

答案 0 :(得分:1)

每隔5秒就可以在读取循环中调用reportProcess,例如

lastTime = time.time()
while True:
    data = f.read(1024)
    if not data:
      break
    self.readBytes += len(data)
    if time.time() - lastTime > 5:
        self.reportProcess()
        lastTime = time.time()

不相关:为什么使用类级别属性,通常它们应该在实例级别,例如。

class FileScanner:
  def __init__(self):
      self.readBytes = 0
      self.lastReadBytes = 0

答案 1 :(得分:0)

您是否可以在reportProcess()循环内的scanFile()函数内调用while。例如,每读取一个x字节就会调用reportProcess()(在while循环中添加条件)。这会解决你的问题吗?