同时处理多个文本文件

时间:2014-10-07 13:04:41

标签: python multiprocessing gevent libevent

*解决了使用pool.map()而不是map_async()和多处理。

Python 2.7 - 如何使用以下代码同时处理多个文本文件的gevent或多处理?

我已经粘贴了gevent和多处理池版本

从日志输出中显示文件正在同步处理,并且在Linux上使用'lsof'确认一次只读取一个文件。

这些文件存储在包含ultra320驱动器阵列的企业级磁盘架上。

我可以使用以下只是睡眠的功能一次打开4个文件,而不是当我逐行尝试打开文件时。 'for file in file'循环是否阻止以某种方式打开下一个文件?

from time import sleep
from multiprocessing import Pool


def hold_open(log):
    with open(log) as fh:
        sleep(60)

pool = Pool(processes=4)
pool.map(hold_open, ['file1', 'file2', 'file3', 'file4'])
pool.join()

我做错了什么以及如何更改以解决问题?

2014-10-07 13:51:51,088 - __main__ - INFO - Found 23 files, duration: 0:00:00.000839
2014-10-07 13:51:51,088 - __main__ - INFO - Now analysing using 8 threads.....
2014-10-07 13:51:51,089 - __main__ - INFO - XSLog2014.05.15-16.40.01.txt - Analysing...
2014-10-07 13:51:51,471 - __main__ - INFO - XSLog2014.05.15-16.40.01.txt - Finished analysing 41943107 bytes duration: 0:00:00.381875
2014-10-07 13:51:51,471 - __main__ - INFO - XSLog2014.09.18-23.53.59.txt.gz - Analysing...
2014-10-07 13:51:53,197 - __main__ - INFO - XSLog2014.09.18-23.53.59.txt.gz - Finished analysing 4017126 bytes duration: 0:00:01.725641
2014-10-07 13:51:53,197 - __main__ - INFO - XSLog2014.09.30-11.45.44.txt.gz - Analysing...
2014-10-07 13:51:54,950 - __main__ - INFO - XSLog2014.09.30-11.45.44.txt.gz - Finished analysing 4970479 bytes duration: 0:00:01.753434
2014-10-07 13:51:54,950 - __main__ - INFO - XSLog2014.09.30-11.46.05.txt.gz - Analysing...
from gevent import monkey; monkey.patch_all()
import os
import re
import gzip
import gevent
import logging
from gevent import pool
from datetime import datetime


log_level = logging.INFO
logger = logging.getLogger(__name__)
logger.setLevel(log_level)
ch = logging.StreamHandler()
ch.setLevel(log_level)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)


def get_time_range(log):
    if not os.path.isfile(log):
        logging.error("\x1b[31m%s - Something went wrong analysing\x1b[0m" % log)
        return
    date_regex = re.compile('^(\d{4}\.\d{2}\.\d{2} \d{2}:\d{2}:\d{2}:\d{3})')

    def process(lh):
        start, end = str(), str()
        logger.info("\x1b[33m%s - Analysing...\x1b[0m" % os.path.basename(log))
        for line in lh:
            date = date_regex.match(line)
            if date:
                if not start:
                    start = date.group(1)
                end = date.group(1)
        return start, end
    start_time = datetime.now()
    size = os.path.getsize(log)
    if os.path.splitext(log)[1] == '.txt':
        with open(log, 'r') as lh:
            start, end = process(lh)
    elif os.path.splitext(log)[1] == '.gz':
        with gzip.open(log, 'r') as lh:
            start, end = process(lh)
    else:
        return
    meta = (log, size, start, end)
    duration = datetime.now() - start_time
    logger.info("\x1b[32m%s - Finished analysing %s bytes duration: %s\x1b[0m" % (os.path.basename(log), size, duration))


def run(directory, pool_size=8, cur=None):
    start = datetime.now()
    worker_pool = gevent.pool.Pool(int(pool_size))
    files = list()
    while True:
        for log in os.listdir(directory):
            if 'XSLog' and 'txt' in log:
                files.append(os.path.join(directory, log))
        logger.info("\x1b[36mFound %s files, duration: %s\x1b[0m" % (len(files), datetime.now() - start))
        logger.info("\x1b[36mNow analysing using %s threads.....\x1b[0m" % pool_size)
        for log in files:
            worker_pool.spawn(get_time_range, log)
        worker_pool.join()
        duration = datetime.now() - start
        logger.info("\x1b[36mFinished analysing - duration: %s\x1b[0m" % duration)


if __name__ == '__main__':
    run('/path/to/log/files')

使用多处理:

def run(directory, pool_size=8, cur=None):
    start = datetime.now()
    worker_pool = gevent.pool.Pool(int(pool_size))
    files = list()
    pool = Pool(processes=pool_size, maxtasksperchild=2)
    while True:
        for log in os.listdir(directory):
            if 'XSLog' and 'txt' in log:
                files.append(os.path.join(directory, log))
        logger.info("\x1b[36mFound %s files, duration: %s\x1b[0m" % (len(files), datetime.now() - start))
        logger.info("\x1b[36mNow analysing using %s threads.....\x1b[0m" % pool_size)
        # pool.map_async(get_time_range, files)
        pool.map(get_time_range, files) # This fixed it.
        pool.join()
        duration = datetime.now() - start
        logger.info("\x1b[36mFinished analysing - duration: %s\x1b[0m" % duration)

1 个答案:

答案 0 :(得分:1)

从并行性中获得的好处数量在这里受到限制,因为您花费了大量时间从磁盘读取数据。磁盘I / O是顺序的;无论你有多少进程/ greenlets,其中只有一个能够一次从磁盘读取。现在,除了从磁盘读取所花费的时间之外,剩下的时间用于在正在读取的行上进行正则表达式匹配。 gevent对此无法帮助您 。它是一个CPU绑定操作,gevent不能用于并行化CPU绑定操作。 gevent对于阻塞I / O操作非阻塞非常有用,它可以启用并行I / O,但是此处没有阻塞I / O.

multiprocessing 可以使正则表达式操作并行运行,因此我希望它比gevent版本执行得更好。但在任何一种情况下,你可能不会比顺序版本更快(如果有的话),因为你花了很多时间从磁盘读取文件。