我有一个很大的dask数据框,并希望将其保存以供以后以任何格式使用,以便快速读取。
到目前为止,我已经尝试过:
.to_hdf
保存。这会产生错误:HDF5ExtError: HDF5 error back trace [...] Can't set attribute 'non_index_axes' in node:
/x (Group) ''.
,这似乎是由于标题大小> 64 KB(根据一些谷歌研究).to_hdf5
,产生TypeError: can't pickle thread.lock objects
TypeError: can't pickle thread.lock objects
对于酸洗错误,我找到了这个最小的例子:
import dask, dask.array
x = np.arange(1000)
da = dask.array.from_array(x, (100))
da.to_hdf5('xyz.hdf5', '/x')
我做错了什么?它一定是显而易见的,我不知道,但就目前而言,我被困住了。如果以某种方式可以直接存储数据帧(使用我在1中尝试的索引和列标识符),这将是我首选的解决方案。如果这是不可能的,任何其他适用于dask的解决方案都会很棒。
编辑:来自aproach(1)的完整错误堆栈:
HDF5ExtError Traceback(最近一次通话 最后)in() ----> 1 voltage_traces.to_hdf(' voltage_traces_hd5.hdf5',' / x')
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/dataframe/core.pyc in_hdf(self,path_or_buf,key,mode,append,complevel,complib, fletcher32,get,** kwargs) 540来自.io import to_hdf 541返回to_hdf(self,path_or_buf,key,mode,append,complevel,complib, - > 542 fletcher32,get = get,** kwargs) 543 544 @derived_from(pd.DataFrame)
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/dataframe/io.pyc in_hdf(df,path_or_buf,key,mode,append,complevel,complib, fletcher32,get,dask_kwargs,name_function,** kwargs) 443 444 DataFrame._get(merge(df.dask,dsk),(name,df.npartitions - 1), - > 445 get = get,** dask_kwargs) 446 447
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/base.pyc 在_get(cls,dsk,keys,get,** kwargs) 90 get = get或_globals [' get']或cls._default_get 91 dsk2 = cls._optimize(dsk,keys,** kwargs) ---> 92返回get(dsk2,keys,** kwargs) 93 94 @classmethod
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在get_sync中(dsk,keys,** kwargs) 517 queue = Queue() 518返回get_async(apply_sync,1,dsk,keys,queue = queue, - > 519 raise_on_exception = True,** kwargs) 520 521
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在get_async中(apply_async,num_workers,dsk,result,cache,queue, get_id,raise_on_exception,rerun_exceptions_locally,callbacks, ** kwargs) 488 f(key,res,dsk,state,worker_id) 489当州['准备好']和len(州['正在运行'])< num_workers: - > 490 fire_task() 491 492#最终报告
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在fire_task()中 459#提交 460 apply_async(execute_task,args = [key,dsk [key],data,queue, - > 461 get_id,raise_on_exception]) 462 463#将初始任务加入线程池
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在apply_sync中(func,args,kwds) 509 def apply_sync(func,args =(),kwds = {}): 510"""朴素同步版本的apply_async""" - > 511 return func(* args,** kwds) 512 513
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在execute_task中(key,task,data,queue,get_id,raise_on_exception) 265""" 266尝试: - > 267 result = _execute_task(任务,数据) 268 id = get_id() 269 result = key,result,None,id
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在_execute_task中(arg,cache,dsk) 246 elif istask(arg): 247 func,args = arg [0],arg [1:] - > 248 args2 = [args]的[_execute_task(a,cache)] 249 return func(* args2) 250 elif nothashable(arg):
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc 在_execute_task中(arg,cache,dsk) 247 func,args = arg [0],arg [1:] 248 args2 = [args]的[_execute_task(a,cache)] - > 249 return func(* args2) 250 elif nothashable(arg): 251 return arg
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in_hdf(self,path_or_buf,key,** kwargs)1099 1100
to_hdf中的
来自pandas.io导入pytables - > 1101 return pytables.to_hdf(path_or_buf,key,self,** kwargs)1102 1103 def to_msgpack(self,path_or_buf = None,encoding =' utf-8',** kwargs):/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc(path_or_buf,key,value,mode,complevel,complib,append, ** kwargs) 258与HDFStore(path_or_buf,mode = mode,complevel = complevel, 259 complib = complib)作为商店: - > 260 f(商店) 261否则: 262 f(path_or_buf)
(商店)中的/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc 253 f = lambda store:store.append(key,value,** kwargs) 254其他: - > 255 f = lambda store:store.put(key,value,** kwargs) 256 257 if isinstance(path_or_buf,string_types):
put中的/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc(self,key,value,format,append,** kwargs) 824 format = get_option(" io.hdf.default_format")或' fixed' 825 kwargs = self._validate_format(格式,kwargs) - > 826 self._write_to_group(key,value,append = append,** kwargs) 827 828 def remove(self,key,where = None,start = None,stop = None): _write_to_group中的 /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc(self,key,value,format,index,append,complib, 编码,** kwargs)1262 1263#写对象 - > 1264 s.write(obj = value,append = append,complib = complib,** kwargs)1265 1266 if s.is_table and index:
写入中的 /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc(self,obj, axes,adnd,complib,complevel,fletcher32, min_itemsize,chunksize,expectedrows,dropna,** kwargs)3799
3800#设置表属性 - > 3801 self.set_attrs()3802 3803#创建表/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in set_attrs(self)3050 self.attrs.index_cols = self.index_cols()3051 self.attrs.values_cols = self.values_cols() - > 3052 self.attrs.non_index_axes = self.non_index_axes 3053 self.attrs.data_columns = self.data_columns 3054
self.attrs.nan_rep = self.nan_rep/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/tables/attributeset.pyc 在 setattr (自我,名称,值) 459 460#设置属性。 - > 461 self._g__setattr(名称,值) 462 463#记录新属性添加。
/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/tables/attributeset.pyc 在_g__setattr中(self,name,value) 401 value = stvalue [()] 402 - > 403 self._g_setattr(self._v_node,name,stvalue) 404 405#新属性或值。将它介绍到本地
tables / hdf5extension.pyx in tables.hdf5extension.AttributeSet._g_setattr (表/ hdf5extension.c:7917)()
HDF5ExtError:HDF5错误回溯
文件" H5A.c",第259行,在H5Acreate2中 无法在H5A_create中创建属性文件" H5Aint.c",第275行 无法在H5O_attr_create对象头文件" H5Oattribute.c",第347行中创建属性 无法在标题文件" H5Omessage.c",第224行,在H5O_msg_append_real中创建新属性 无法在H5O_msg_alloc中创建新消息文件" H5Omessage.c",第1945行 无法为H5O_alloc中的消息文件" H5Oalloc.c",第1142行分配空间 对象标题消息太大
HDF5错误返回跟踪结束
无法设置属性' non_index_axes'在节点:/ x(组)''。