多处理的内存使用量稳步增长.Pool.imap_unordered

时间:2016-12-01 23:55:54

标签: python multithreading multiprocessing

我刚注意到我的程序在处理大文件时使用的内存越来越多。但它一次只处理一行,所以我无法弄清楚为什么它会继续使用更多的内存。

经过大量挖掘,我意识到该计划有三个部分:

  1. 一次加载一行数据。
  2. 使用multiprocessing.Pool处理imap_unordered()中的每一行。
  3. 在一个线程中处理每一行。
  4. 如果步骤1和步骤2比步骤3快,那么池工作人员的结果将排队,消耗内存。

    如何限制我在第2步中输入到池中的数据,因此在步骤3中它不会超过消费者?

    这与another multiprocessing question类似,但我不清楚延迟在哪个问题上。

    这是一个展示问题的小例子:

    import logging
    import os
    import multiprocessing
    from time import sleep
    
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s:%(process)d:%(thread)d:%(message)s')
    logger = logging.getLogger()
    
    def process_step1():
        data = 'a' * 100000
        for i in xrange(10000):
            sleep(.001)  # Faster than step 3.
            yield data
            if i % 1000 == 0:
                logger.info('Producing %d.', i)
        logger.info('Finished producing.')
    
    
    def process_step2(data):
        return data.upper()
    
    
    def process_step3(up_data):
        assert up_data == 'A' * 100000
        sleep(.005)  # Slower than step 1.
    
    
    def main():
        pool = multiprocessing.Pool(processes=10)
        logger.info('Starting.')
        loader = process_step1()
        processed = pool.imap_unordered(process_step2, loader)
        for i, up_data in enumerate(processed):
            process_step3(up_data)
            if i % 500 == 0:
                logger.info('Consuming %d, using %0.1f MB.', i, get_memory())
        logger.info('Done.')
    
    
    def get_memory():
        """ Look up the memory usage, return in MB. """
        proc_file = '/proc/{}/status'.format(os.getpid())
        scales = {'KB': 1024.0, 'MB': 1024.0 * 1024.0}
        with open(proc_file, 'rU') as f:
            for line in f:
                if 'VmSize:' in line:
                    fields = line.split()
                    size = int(fields[1])
                    scale = fields[2].upper()
                    return size*scales[scale]/scales['MB']
        return 0.0  # Unknown
    
    main()
    

    当它运行时,我看到内存使用量稳步增加,直到第1步结束。如果我在此之后让它运行足够长时间,内存使用将开始减少。

    2016-12-01 15:37:50,859:6414:139712380557056:Starting.
    2016-12-01 15:37:50,861:6414:139712266237696:Producing 0.
    2016-12-01 15:37:50,868:6414:139712380557056:Consuming 0, using 255.0 MB.
    2016-12-01 15:37:52,054:6414:139712266237696:Producing 1000.
    2016-12-01 15:37:53,244:6414:139712266237696:Producing 2000.
    2016-12-01 15:37:53,421:6414:139712380557056:Consuming 500, using 383.0 MB.
    2016-12-01 15:37:54,446:6414:139712266237696:Producing 3000.
    2016-12-01 15:37:55,635:6414:139712266237696:Producing 4000.
    2016-12-01 15:37:55,976:6414:139712380557056:Consuming 1000, using 511.2 MB.
    2016-12-01 15:37:56,831:6414:139712266237696:Producing 5000.
    2016-12-01 15:37:58,019:6414:139712266237696:Producing 6000.
    2016-12-01 15:37:58,529:6414:139712380557056:Consuming 1500, using 703.2 MB.
    2016-12-01 15:37:59,209:6414:139712266237696:Producing 7000.
    2016-12-01 15:38:00,406:6414:139712266237696:Producing 8000.
    2016-12-01 15:38:01,084:6414:139712380557056:Consuming 2000, using 831.5 MB.
    2016-12-01 15:38:01,602:6414:139712266237696:Producing 9000.
    2016-12-01 15:38:02,802:6414:139712266237696:Finished producing.
    2016-12-01 15:38:03,640:6414:139712380557056:Consuming 2500, using 959.5 MB.
    2016-12-01 15:38:06,199:6414:139712380557056:Consuming 3000, using 959.5 MB.
    

1 个答案:

答案 0 :(得分:6)

似乎Pool.imap_unordered()启动一个新线程来迭代步骤1生成的输入序列,因此我们需要从运行步骤3的主线程中限制该线程。Semaphore class是设计用于限制另一个线程的一个线程,因此我们在生成每一行之前调用acquire(),并在消耗每一行时调用release()。如果我们以100这样的任意值启动信号量,那么在阻塞并等待消费者赶上之前它将产生100行的缓冲区。

import logging
import os
import multiprocessing
from threading import Semaphore
from time import sleep

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s:%(process)d:%(thread)d:%(message)s')
logger = logging.getLogger()

def process_step1(semaphore):
    data = 'a' * 100000
    for i in xrange(10000):
        semaphore.acquire()
        sleep(.001)  # Faster than step 3.
        yield data
        if i % 1000 == 0:
            logger.info('Producing %d.', i)
    logger.info('Finished producing.')


def process_step2(data):
    return data.upper()


def process_step3(up_data, semaphore):
    assert up_data == 'A' * 100000
    sleep(.005)  # Slower than step 1.
    semaphore.release()


def main():
    pool = multiprocessing.Pool(processes=10)
    semaphore = Semaphore(100)
    logger.info('Starting.')
    loader = process_step1(semaphore)
    processed = pool.imap_unordered(process_step2, loader)
    for i, up_data in enumerate(processed):
        process_step3(up_data, semaphore)
        if i % 500 == 0:
            logger.info('Consuming %d, using %0.1f MB.', i, get_memory())
    logger.info('Done.')


def get_memory():
    """ Look up the memory usage, return in MB. """
    proc_file = '/proc/{}/status'.format(os.getpid())
    scales = {'KB': 1024.0, 'MB': 1024.0 * 1024.0}
    with open(proc_file, 'rU') as f:
        for line in f:
            if 'VmSize:' in line:
                fields = line.split()
                size = int(fields[1])
                scale = fields[2].upper()
                return size*scales[scale]/scales['MB']
    return 0.0  # Unknown

main()

现在内存使用量很稳定,因为生产者并没有远远超过消费者。

2016-12-01 15:52:13,833:6695:140124578850560:Starting.
2016-12-01 15:52:13,835:6695:140124535109376:Producing 0.
2016-12-01 15:52:13,841:6695:140124578850560:Consuming 0, using 255.0 MB.
2016-12-01 15:52:16,424:6695:140124578850560:Consuming 500, using 255.0 MB.
2016-12-01 15:52:18,498:6695:140124535109376:Producing 1000.
2016-12-01 15:52:19,015:6695:140124578850560:Consuming 1000, using 255.0 MB.
2016-12-01 15:52:21,602:6695:140124578850560:Consuming 1500, using 255.0 MB.
2016-12-01 15:52:23,675:6695:140124535109376:Producing 2000.
2016-12-01 15:52:24,192:6695:140124578850560:Consuming 2000, using 255.0 MB.
2016-12-01 15:52:26,776:6695:140124578850560:Consuming 2500, using 255.0 MB.
2016-12-01 15:52:28,846:6695:140124535109376:Producing 3000.
2016-12-01 15:52:29,362:6695:140124578850560:Consuming 3000, using 255.0 MB.
2016-12-01 15:52:31,951:6695:140124578850560:Consuming 3500, using 255.0 MB.
2016-12-01 15:52:34,022:6695:140124535109376:Producing 4000.
2016-12-01 15:52:34,538:6695:140124578850560:Consuming 4000, using 255.0 MB.
2016-12-01 15:52:37,128:6695:140124578850560:Consuming 4500, using 255.0 MB.
2016-12-01 15:52:39,193:6695:140124535109376:Producing 5000.
2016-12-01 15:52:39,704:6695:140124578850560:Consuming 5000, using 255.0 MB.
2016-12-01 15:52:42,291:6695:140124578850560:Consuming 5500, using 255.0 MB.
2016-12-01 15:52:44,361:6695:140124535109376:Producing 6000.
2016-12-01 15:52:44,878:6695:140124578850560:Consuming 6000, using 255.0 MB.
2016-12-01 15:52:47,465:6695:140124578850560:Consuming 6500, using 255.0 MB.