Python写入csv中的多处理问题

时间:2019-01-05 15:10:01

标签: python multithreading multiprocessing

我创建了一个以日期作为参数的函数,并将产生的输出写入csv。如果我运行例如多处理池28个任务,我有一个100个日期的列表,然后输出的csv文件中的最后72行比应有的长两倍(只是最后72行的合并重复)。

我的代码:

import numpy as np
import pandas as pd
import multiprocessing

#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
    ...
    # for each date in df.date.unique() do stuff which gives sample dataframe
    # as an output

    return sample

# list_s is a list of dates I want to calculate function funk for   

def mp_handler():
# 28 is a number of processes I want to run
    p = multiprocessing.Pool(28)
    for result in p.imap(funk, list_s[0:100]):
        result.to_csv('crsp_full.csv', mode='a')


if __name__=='__main__':
    mp_handler()

输出看起来像这样:

date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315

...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972

我试图将lock()插入funk()中,但是产生了相同的结果,只是花了更多的时间来实现。有任何解决方法的想法吗?

修改funk看起来像这样。 e等同于日期。

def funk(e):
    block = pd.DataFrame()
    i = s_list.index(e)
    if i > 19:
        ran = s_list[i-19:i+6]
        ran0 = s_list[i-19:i+1]
        # print ran0
        piv = df.pivot(index='date', columns='permno', values='date')
        # Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
        s = list(piv.loc[ran].dropna(axis=1).columns)
        sample = df[df['permno'].isin(s)]
        sample = sample.loc[ran]
        permno = ['10001', '93422']
        sample = sample[sample['permno'].isin(permno)]
        # print sample.index.unique()
        # get past 20 days returns in additional 20 columns
        for i in range(0, 20):
            sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
        #merge dataset with betas
        sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
        sample['ex_ret'] = 0

        # calculate expected return
        for i in range(0,20):
            sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
        # print(sample)
        # define a stock into two legs based on expected return
        sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
        # workaround for short leg, multiply returns by -1
        sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
        # create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
        for i in range(1,6):
            sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
            sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
        sample = sample.reset_index(drop=True)
        sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
        for i in range(1, 5):
            sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
        sample = sample.dropna(how='any')
        for k in range(0,20):
            sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
        for k in range(1, 6):
            sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
            q = ['port_ret_{}'.format(k)]
            list_names.extend(q)
        block = sample.groupby('date')[list_names].sum().copy()
    return block

0 个答案:

没有答案