python多处理工作者内存消耗无限增加

时间:2017-06-02 15:42:37

标签: python multiprocessing out-of-memory

我有一个需要搜索特定值的Excel 2010文件(xlsx)列表。由于xslx是二进制格式,因此无法使用普通文本编辑器。所以我为每个文件执行以下操作

  1. 获取文件名
  2. 在熊猫中打开
  3. 将数据帧转换为numpy数组
  4. 检查数组值
  5. 这需要多处理,因为它不受I / O限制。大熊猫的东西和数组转换需要时间。所以我已经设置了我的脚本的多处理版本(见下文):

    问题是每个工作进程的内存消耗。尽管每个xlsx文件只有100kb,但每个工作程序的内存持续高达2GB。我不明白为什么在处理新文件之前没有释放内存。这样我在处理文件列表之前就会耗尽内存。

    问题似乎不是队列,而是熊猫的东西。

    这是我的代码。它可以使用您系统上的任何xlsx文件进行测试。

    import pandas as pd
    import multiprocessing as mp
    import glob
    
    path = r'c:\temp'
    fileFilter = 'serial.xlsx'
    searchString = '804.486'
    
    
    def searchFile(tasks, results, searchString):
        """Iterates over files in tasks and searches in file for the
        occurence of 'searchString'.
    
        Args:
        -----
        tasks: queue of strings
            Files to look in
        results: queue of strings
            Files where the searchString was found
        searchString: str
            the string to be searched
        """
        # for files in the queue
        for task in iter(tasks.get, 'STOP'):
            # read the filestructre into memory
            xfile = pd.ExcelFile(task)
            # iterate all sheets
            for sheet in xfile.sheet_names[:3]:
                # read the sheet
                data = pd.read_excel(xfile, sheet)
                # check if searchString is in numpy representation of dataframe
                if searchString in data.values.astype(str):
                    # put filename in results queue
                    results.put(task)
                    break
            xfile.close()
    
    if __name__ == "__main__":
        # get all files matching the filter that are in the root path
        print('gathering files')
        files = glob.glob(path + '\**\{}'.format(fileFilter), recursive=True)
    
        # setup of queues and variables
        n_proc = 2
        tasks = mp.Queue()
        results = mp.Queue()
    
        print('Start processing')
        # setup processes and start them
        procs = [mp.Process(target=searchFile,
                            args=(tasks, results, searchString))
                 for x in range(n_proc)]
        for p in procs:
            p.daemon = True
            p.start()
    
        # populate queue
        for file in files:
            tasks.put(file)
    
        for proc in procs:
            tasks.put('STOP')
    
        for p in procs:
            p.join()
    
        # print results
        for result in range(results.qsize()):
            print(results.get())
    
        print('Done')
    

1 个答案:

答案 0 :(得分:0)

gc中的问题似乎无法收集你永远不会离开的函数上下文中的pandas帧。您可以使用multiprocessing.Pool.map来为您执行队列处理。将为每个项目调用Worker函数,让gc完成工作。您还可以使用maxtasksperchild池构造函数参数来限制worker处理的项目数量。

import glob
import multiprocessing


def searchFile(task, searchString):
    xfile = pd.ExcelFile(task)
    ...
    if found:
        return task


if __name__ == '__main__':
    files = glob.glob(path + '\**\{}'.format(fileFilter), recursive=True)
    searchString = '804.486'

    pool = multiprocessing.Pool(2, maxtasksperchild=10)

    args = ((fname, searchString) for fname in files)
    matchedFiles = filter(None, pool.map(searchFile, args))
    pool.close()