并行处理将字典写入多个csv文件

时间:2017-12-08 16:29:49

标签: python dictionary parallel-processing

我有一个大型数据框,我想写入不同的文件,具体取决于特定列中的值。

第一个函数使用一个字典,其中键是要写出的文件,值是一个numpy数组,它是原始数据帧的子集。

def write_in_parallel(inputDict):
    for key,value in inputDict.items():
        df = pd.DataFrame(value)
        with open(baseDir + outDir + outputFileName + key + outputFileType, 'a') as oFile:
            data.to_csv(oFile, sep = '|', index = False, header = False)
        print("Finished writing month: " + outputFileName + key)

function 2获取用于对数据帧和数据帧本身进行分区的列值,并返回数据帧。

def make_slices(files, df):
    outlist = dict()
    for item in files:
        data = np.array(df[df.iloc[:,1] == item])
        outlist[item] = data
    return outlist

最终函数使用多处理来调用write_in_parallel并从make_slices迭代字典,希望并行。

def make_dynamic_columns():
    perfPath = baseDir + rawDir
    perfFiles = glob.glob(perfPath + "/*" + inputFileType)
    perfFrame = pd.DataFrame()
    for file_ in perfFiles:
        df = pd.read_table(file_, delimiter = '|', header = None)

        df.fillna(missingDataChar,inplace=True)
        df.iloc[:,1] = df.iloc[:,1].astype(str)

        fileList = list(df.iloc[:, 1].astype('str').unique())

        with mp.Pool(processes=10) as pool:
            pool.map(write_in_parallel, make_slices(fileList, df))

我得到的错误是&#;; str对象没有属性项'这让我相信pool.map和write_in_parallel没有收到字典。我不知道如何解决这个问题。非常感谢任何帮助。

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "_FHLMC_LLP_dataprep.py", line 22, in write_in_parallel
    for key,value in dict.items():
AttributeError: 'str' object has no attribute 'items'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "_FHLMC_LLP_dataprep.py", line 59, in <module>
    make_dynamic_columns_freddie()
  File "_FHLMC_LLP_dataprep.py", line 55, in make_dynamic_columns_freddie
    pool.map(write_in_parallel, dictinput)
  File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
AttributeError: 'str' object has no attribute 'items'

1 个答案:

答案 0 :(得分:1)

你的问题是make_slices返回一个字典而不是一个列表,pool.map()不喜欢这样。它只是将您的字典键传递给您的工作人员,这意味着它们是字符串(尝试将您收到的内容打印为inputDict)。它不是字典而只是键。

def make_slices(files, df):
    outlist = []
    for item in files:
        data = df + item
        outlist.append({item: data})
    return outlist

你可以尝试这样的事情,这样你实际上会返回一个列表吗?然后成员将成为字典项目。 (我不得不修改你的代码,只是在数据中创建一些东西来测试)。

这样,如果您想要这样做,您可以在工作人员中接收密钥和相关数据项。