使用多处理池预处理数据

时间:2018-10-27 02:43:37

标签: python pandas multiprocessing python-multiprocessing

我正在使用Pandas数据框预处理大数据(1000万)。 我尝试一次多次获取数据,然后一次尝试使用多重预处理,而不是一次全部获取。 为了进行测试,游标中有10000行,并一次获取1000行:

for i in range(loop_num): # loop_num=10
    begin_time = time.time()
    rows = cursor.fetchmany(1000)  # get 1000 for a time
    end_time1 = time.time()
    print("cost %.2f seconds fetching data for No%d time" % (end_time1-begin_time, i))

    # data preprocessing
    print("preprocessing data for one time...")
    p = Pool(5)  # use multiprocessing
    results = []
    tasks_num = 10
    rows_num = int(math.ceil(1000 / tasks_num)) # rows in each subprocess

    for j in range(tasks_num):
        row = rows[(j * rows_num):((j + 1) * rows_num)]
        print("get %d rows" % len(row))
        # save preprocessed data in results
        results.append(p.apply_async(onetime_clean, args=(table_cols, row, predict_month), error_callback=bar))

    print('Waiting for all subprocesses done...')
    p.close()
    p.join()
    df_nps_sca = pd.DataFrame(columns=results[0].get().columns.tolist())

    # get the result as a dataframe
    for m in range(len(results)):
        df_nps_sca = df_nps_sca.append(results[m].get(),ignore_index=True, sort=False)
    print(df_nps_sca.shape)
    end_time_per = time.time()
    print("cost %.2f seconds NO%d time" % (end_time_per-begin_time, i))
# end for
print("data preprocessing finished!")
conn.close()

这是我的onetime_clean函数:

def onetime_clean(table_cols, rows, predict_month):
    num_columns = ['AGE', 'DURATION', 'ENTERTAINMENT', 'SOCIALITY', 'LIFE']
    scene_columns = ['SCENIC_SCENE', 'TRAFIC_SCENE','OFFICE_SCENE','PUBLIC_SCENE']

    # change rows into dataframe
    df_nps = pd.DataFrame(rows, columns=table_cols)
    # call another function which just do some transfer
    df_nps = data_transfer(df_nps)

    # load model imputer and scaler from training data
    sc_x = joblib.load('./sc_X.pkl')
    imputer = joblib.load('./imputer.pkl')

    # using imputer fillna  
    df_nps[num_columns] = imputer.transform(df_nps[num_columns])

    # scaler without the first col and change them into dataframe
    df_nps_sca = sc_x.transform(df_nps.iloc[:, 1:])  # got error this line
    df_nps_sca = pd.DataFrame(df_nps_sca, index=df_nps.index, columns=sca_cols)
    return df_nps_sca 

执行时,我在某些子处理(不是全部)中得到了错误,而其余的都成功了:

  

失败:操作数不能与形状(100,62)(63,)(100,62)一起广播

看起来有些变量一次被多个子处理所更改,但我不知道为什么。我不知道,请帮忙。

0 个答案:

没有答案