读取大文件(例如:40M)后,“ pwd”命令会花费大量时间

时间:2019-07-02 13:33:04

标签: python

以下python代码包含三个进程:

  • 过程1:调用connect_machines()(实际上是执行pwd命令)

  • 过程2:调用get_machines()(实际上是读取一个大文件)

  • 过程3:做与短语1相同的事情

第3步的时间成本比第1步大得多

conten_big.txt文件是包含json数据的文件,其大小为39M

当我运行main()函数时,end_time2 - start2的值为22.04s,而end_time1 - start1的值为08.51s

当我注释#machines_a = get_machines()行,然后运行main函数时,end_time1 - start1的值几乎等于end_time2 - start2

import sys
import pdb
import os
import json
import time
import datetime
import logging
import commands
def get_logger(logger_name):
    """configger the logger """
    logging.basicConfig(level = logging.INFO, \
            format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
            datefmt='%a, %d %b %Y %H:%M:%S',
            filename='./log/%s.log' % logger_name,
            filemode='w')
    logger = logging.getLogger(logger_name)
    return logger

logger = get_logger('test_log')

def get_machines():
    print 'get machines start'
    fp = open('./conten_big.txt', 'r')
    machines = fp.read()
    fp.close()
    machines = json.loads(machines)
    print 'get machines have finished',len(machines)
    return machines

def connect_machines(loop_count):
    for idex in range(0, loop_count):
        connect_port(idex)

def connect_port(idex):
    ret2 = 0
    cmd = 'pwd'
    start_time=datetime.datetime.now()
    (status, msg) = commands.getstatusoutput(cmd)
    end_time=datetime.datetime.now()
    cost = str(end_time-start_time)
    logger.info("[%d] --[%d] -- [%s] %s" % (idex, status, msg, cost))

def main(argv):
    """main """
    nowTime=datetime.datetime.now()
    print nowTime.strftime('%Y-%m-%d %H:%M:%S')
    machine_count = 5000

    logger.info("=====================>>>>")
    start1=datetime.datetime.now()
    print start1.strftime('%Y-%m-%d %H:%M:%S')
    connect_machines(machine_count)
    end_time1=datetime.datetime.now()
    print end_time1.strftime('%Y-%m-%d %H:%M:%S')
    logger.info("[%s] --- [%s] ---[%s]" % (end_time1, start1, end_time1 - start1))
    print end_time1, start1, end_time1 - start1
    #read one big file, eg. a file size 39M
    machines_a = get_machines()
    logger.info("=====================")

    time.sleep(30)

    start2=datetime.datetime.now()
    print start2.strftime('%Y-%m-%d %H:%M:%S')
    connect_machines(machine_count)
    end_time2=datetime.datetime.now()
    print end_time2.strftime('%Y-%m-%d %H:%M:%S')
    logger.info("[%s] --- [%s] ---[%s]" % (end_time2, start2, end_time2 - start2))
    print end_time2, start2, end_time2 - start2

if __name__ == '__main__':
    main(sys.argv)

1 个答案:

答案 0 :(得分:0)

程序之所以要花时间,是因为文件很大(〜40MB),正如您之前所说的,注释get_machines()可以大大减少执行时间。

end_time1 - start1end_time2 - start2进行比较是没有意义的,因为只有5000次迭代的for循环比读取非常大的文件要快得多,因为要处理大量的二进制数据需要更长的时间