我有一个大型数据框,我想写入不同的文件,具体取决于特定列中的值。
第一个函数使用一个字典,其中键是要写出的文件,值是一个numpy数组,它是原始数据帧的子集。
def write_in_parallel(inputDict):
for key,value in inputDict.items():
df = pd.DataFrame(value)
with open(baseDir + outDir + outputFileName + key + outputFileType, 'a') as oFile:
data.to_csv(oFile, sep = '|', index = False, header = False)
print("Finished writing month: " + outputFileName + key)
function 2获取用于对数据帧和数据帧本身进行分区的列值,并返回数据帧。
def make_slices(files, df):
outlist = dict()
for item in files:
data = np.array(df[df.iloc[:,1] == item])
outlist[item] = data
return outlist
最终函数使用多处理来调用write_in_parallel
并从make_slices
迭代字典,希望并行。
def make_dynamic_columns():
perfPath = baseDir + rawDir
perfFiles = glob.glob(perfPath + "/*" + inputFileType)
perfFrame = pd.DataFrame()
for file_ in perfFiles:
df = pd.read_table(file_, delimiter = '|', header = None)
df.fillna(missingDataChar,inplace=True)
df.iloc[:,1] = df.iloc[:,1].astype(str)
fileList = list(df.iloc[:, 1].astype('str').unique())
with mp.Pool(processes=10) as pool:
pool.map(write_in_parallel, make_slices(fileList, df))
我得到的错误是&#;; str对象没有属性项'这让我相信pool.map和write_in_parallel
没有收到字典。我不知道如何解决这个问题。非常感谢任何帮助。
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "_FHLMC_LLP_dataprep.py", line 22, in write_in_parallel
for key,value in dict.items():
AttributeError: 'str' object has no attribute 'items'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "_FHLMC_LLP_dataprep.py", line 59, in <module>
make_dynamic_columns_freddie()
File "_FHLMC_LLP_dataprep.py", line 55, in make_dynamic_columns_freddie
pool.map(write_in_parallel, dictinput)
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
AttributeError: 'str' object has no attribute 'items'
答案 0 :(得分:1)
你的问题是make_slices返回一个字典而不是一个列表,pool.map()
不喜欢这样。它只是将您的字典键传递给您的工作人员,这意味着它们是字符串(尝试将您收到的内容打印为inputDict
)。它不是字典而只是键。
def make_slices(files, df):
outlist = []
for item in files:
data = df + item
outlist.append({item: data})
return outlist
你可以尝试这样的事情,这样你实际上会返回一个列表吗?然后成员将成为字典项目。 (我不得不修改你的代码,只是在数据中创建一些东西来测试)。
这样,如果您想要这样做,您可以在工作人员中接收密钥和相关数据项。