Python多处理内存分配问题

时间:2019-01-04 17:21:22

标签: python parallel-processing python-multiprocessing

我正在使用multiprocessing模块。我正在从MySQL数据库中读取表,并并行地在块中这样做。一切工作都很好,直到我增加每个数据框的大小为止。

要点:我遍历列表中的每个表名(tbl_name),并以多个并行块的形式读取表。我将这些中间块存储在字典中。最后,我使用pd.concat组合了字典中的所有数据框。

就像我说的那样,问题是Memory Allocation Issue,当我达到大约1,000,000行时,所以我有3个问题:

1)我在哪里可以执行一些内存清理(例如del对象,gc.collect())?也就是说,我的数据的内存副本在哪里制作,何时可以安全地释放它们?我曾尝试在一些地方放置del语句,但对我大吼大叫这些对象不存在(因为它们仅在并行循环中生存和死亡)。

2)由于我正在尝试启动Pool的资源,并在多个表上调用相同的函数,我是否要等待Pool.close()直到最后一个循环?看来我不必要地打开和关闭资源池并造成内存泄漏。

3)我有一个回调函数,该函数获取每个中间数据帧并将其存储在字典中,然后在并行执行完成后最后组合所有这些数据帧。我应该避免这种情况,只返回原始输出,并在最后合并吗?如果是这样,如何将列表列表pd.concat放入数据框?

由于它很敏感,因此我不得不隐藏一些代码,但是要点是sql_to_dataframe()将ODBC查询发送到MySQL数据库并返回内存中的数据帧。

完整代码:

import halosql
import pandas as pd
import datetime
import numpy as np

import time
import gc
from multiprocessing import Process, Queue, Pool

# parameters
limit_nbr = 500000
chunk_threshold = 100000
halo_open = datetime.date(2018, 2, 12)
extract_date = datetime.date(2018, 2, 13)
halo_close = datetime.date(9999, 12, 31)
ecf_id = 'ceedbb4e-c180-4ff1-9c77-d59fb65873c9'
ts_rowid_range = 20180213143418000000000001
TB_tbl_name = 'GLT0'
GL_lines_tbl_name = 'BSEG'

cm = halosql.Connections(cm_ipaddress, cm_dbname, cm_username, cm_userpass)

run_time = time.time() # PKB

def create_sql_chunk(tbl_name, chunk_nbr):
    df = cm.sql_to_dataframe(
        'SELECT * FROM {tbl_name} WHERE uid_row {query_filter}'.format(tbl_name=tbl_name,query_filter=chunk_dict[chunk_nbr]))
    # time.sleep(chunk_nbr*2)
    return df


results = {}
def collect_results(result):
    """Uses apply_async's callback to setup up a separate Queue for each process"""
    results[time.time()] = result


tbl_list = ['BSEG', 'SKAT', 'TCURX', 'SKA1', 'T009', 'T001', 'TBSLT', 'T003T', 'T003', 'TJ01T', 'TSTCT', 'TTYPT', 'USR02',
            'T881', 'BKPF', 'GLT0', 'ADRP', 'USR21', 'T004T', 'GLT0'] # load data

time_dict = {}
df_dict = {}

for tbl_name in tbl_list:
    read_time = time.time()  # PKB
    print('Reading tbl ' + tbl_name + ' from MemSQL')
    cm.execute('USE ' + cm_dbname + ';')
    uidrow_guide = cm.sql_to_dataframe('SELECT min(uid_row) as row_start, max(uid_row) as row_end FROM {tbl_name}'.format(tbl_name=tbl_name,limit_nbr=limit_nbr))
    total_row_size = int(str(uidrow_guide.row_end[0])[:-3]) if limit_nbr is None or int(str(uidrow_guide.row_end[0])[:-3]) < limit_nbr else limit_nbr
    row_size = total_row_size
    chunk_nbr = int(total_row_size / chunk_threshold)
    chunk_nbr = 1  if chunk_nbr == 0 else chunk_nbr
    chunk_size = int(row_size / chunk_nbr)
    if (chunk_size * chunk_nbr) < row_size:
        chunk_nbr += 1
    chunk_dict = {}
    row_start = 1
    row_end = row_start + chunk_size - 1
    uid_row_start = str(row_start) + '001'
    uid_row_end = str(row_end) + '001'
    for c in list(range(1,chunk_nbr+1)):
        tbl_read_time = time.time()
        if c < chunk_nbr:
            chunk_dict[c] = 'BETWEEN ' + uid_row_start + ' AND ' + uid_row_end
        else:
            chunk_dict[c] = 'BETWEEN ' + uid_row_start + ' AND ' + str(total_row_size) + '001'

        row_start += chunk_size
        row_end = row_start + chunk_size - 1
        uid_row_start = str(row_start) + '001'
        uid_row_end = str(row_end) + '001'

    if __name__ == "__main__":
        start_time = time.time()

        # Repeats the compute intensive operation on 10 data frames concurrently
        p = Pool(processes=chunk_nbr)
        for i in list(range(1,chunk_nbr+1)):
            p.apply_async(create_sql_chunk, args=(tbl_name,i,), callback=collect_results)
        print("--- %s seconds ---" % (time.time() - start_time))
        p.close()
        p.join()

        df_dict['df_' + tbl_name] = pd.concat(results.values())
        gc.collect()

    read_time = (time.time() - read_time) / 60
    time_dict['READ df_' + tbl_name] = read_time

print(df_dict)
print(time_dict)
run_time = (time.time() - run_time) / 60
print("--- %s minutes to finish ---" % run_time) # PKB

0 个答案:

没有答案