我有一个很大的dask数据帧,并希望将其保存以供以后以任何格式使用,从而允许快速读取访问 .

到目前为止,我尝试过:

  • 通过 .to_hdf 保存 . 这会产生错误: HDF5ExtError: HDF5 error back trace [...] Can't set attribute 'non_index_axes' in node: /x (Group) ''. ,这似乎是由于 Headers 大小> 64 KB(根据一些谷歌研究)

  • 将其转换为pandas框架并保存 . 这需要我的所有RAM并且只能以固定格式工作,而不是以表格格式工作(因为如果我尝试,我会得到与上面相同的错误) . 但在固定格式中,数据只能单调读取 - 不是真的,我想要的 .

  • 将dask数据帧转换为dask数组,如下所述:Dask Array from DataFrame并将其保存为 .to_hdf5 ,产生 TypeError: can't pickle thread.lock objects

  • 使用此处给出的命令http://dask.pydata.org/en/latest/array-creation.html#other-on-disk-storage以bcolz格式保存上面创建的dask数组,这会导致错误: TypeError: can't pickle thread.lock objects

  • 将dask数据帧保存到castra - 不起作用,但由于"experimental"状态,无论如何不是我的首选解决方案 .

对于酸洗错误,我发现了这个最小的例子:

import dask, dask.array
x = np.arange(1000)
da = dask.array.from_array(x, (100))
da.to_hdf5('xyz.hdf5', '/x')

我究竟做错了什么?它一定是显而易见的,我不知道,但就目前而言,我被困住了 . 如果以某种方式可以直接存储数据帧(使用我在1中尝试的索引和列标识符),这将是我首选的解决方案 . 如果这是不可能的,任何其他适用于dask的解决方案都会很棒 .

编辑:来自aproach(1)的完整错误堆栈:

HDF5ExtError Traceback(最近一次调用最后一次)in()----> 1 voltage_traces.to_hdf('voltage_traces_hd5.hdf5','/ x')/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site- to /hdf中的packages / dask / dataframe / core.pyc(self,path_or_buf,key,mode,append,complevel,complib,fletcher32,get,** kwargs)540来自.io import to_hdf 541返回to_hdf(self,path_or_buf,key, mode,append,complevel,complib, - > 542 fletcher32,get = get,** kwargs)543 544 @derived_from(pd.DataFrame)/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/ to_hdf中的dask / dataframe / io.pyc(df,path_or_buf,key,mode,append,complevel,complib,fletcher32,get,dask_kwargs,name_function,** kwargs)443 444 DataFrame._get(merge(df.dask,dsk) ,(名称,df.npartitions - 1), - > 445 get = get,** dask_kwargs)446 447 /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/base.pyc in _get(cls,dsk,keys,get,** kwargs)90 get = get或_globals ['get']或cls._default_get 91 dsk2 = cls._optimize(dsk,keys,** kwargs)---> 9 2返回get(dsk2,keys,** kwargs)93 94 @classmethod /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc in get_sync(dsk,keys,** kwargs )517 queue = Queue()518返回get_async(apply_sync,1,dsk,keys,queue = queue, - > 519 raise_on_exception = True,** kwargs)520 521 / nas1 / Data_arco / prgr / Anaconda / lib / python2 . 7 / get_async中的site-packages / dask / async.pyc(apply_async,num_workers,dsk,result,cache,queue,get_id,raise_on_exception,rerun_exceptions_locally,callbacks,** kwargs)488 f(key,res,dsk,state,worker_id )489 while state ['ready']和len(state ['running'])<num_workers: - > 490 fire_task()491 492#Final reporting /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site fire_task()中的-packages / dask / async.pyc 459#Submit 460 apply_async(execute_task,args = [key,dsk [key],data,queue, - > 461 get_id,raise_on_exception])462 463#将初始任务输入apply_sync中的线程池/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc(func, args,kwds)509 def apply_sync(func,args =(),kwds = {}):510“”“apply_async的朴素同步版本”“” - > 511 return func(* args,** kwds)512 513 execute_task中的/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc(key,task,data,queue,get_id,raise_on_exception)265“”“266尝试: - > 267 result = _execute_task(task,data)268 id = get_id()269 result = key,result,None,id /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc in _execute_task (arg,cache,dsk)246 elif istask(arg):247 func,args = arg [0],arg [1:] - > 248 args2 = [_execute_task(a,cache)for a args] 249 return func (* args2)250 elif not ishable(arg):/ _scute_task中的/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/dask/async.pyc(arg,cache,dsk)247 func,args = arg [0],arg [1:] 248 args2 = [_execute_task(a,cache)for a args] - > 249 return func(* args2)250 elif not hashable(arg):251 return arg/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in_hdf(self,path_or_buf,key,** kwargs)1099 1100 from pandas.io import pytables - > 1101 return pytables.to_hdf(path_or_buf,key,self,** kwargs)1102 1103 def to_msgpack(self,path_or_buf = None,encoding ='utf-8',** kwargs):/ nas1 / Data_arco / prgr / Anaconda / lib / python2.7 / site-packages / pandas / io / pytables.pyc in_hdf(path_or_buf,key,value,mode,complevel,complib,append,** kwargs)258 with HDFStore(path_or_buf,mode = mode,complevel = complevel, 259 complib = complib)as store: - > 260 f(store)261 else:262 f(path_or_buf)/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in(store)253 f = lambda store:store.append(key,value,** kwargs)254 else: - > 255 f = lambda store:store.put(key,value,** kwargs)256 257 if isinstance (path_or_buf,string_types):/ nas1 /Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in put(self,key,value,format,append,* * kwargs)824 format = get_option(“io.hdf.default_format”)或'fixed'825 kwargs = self._validate_format(format,kwargs) - > 826 self._write_to_group(key,value,append = append,** kwargs )827 828 def remove(self,key,where = None,start = None,stop = None):/ nas1 / Data_arco / prgr / Anaconda / lib / python2.7 / site-packages / pandas / ic / pytables.pyc in _write_to_group(self,key,value,format,index,append,complib,encoding,** kwargs)1262 1263#write the object - > 1264 s.write(obj = value,append = append,complib = complib,** kwargs )1265 1266 if s.is_table和index:/nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in write(self,obj,axes,append,complib,complevel) ,fletcher32,min_itemsize,chunksize,expectedrows,dropna,** kwargs)3799 3800#设置表属性 - > 3801 self.set_attrs()3802 3803#创建表/nas1/Data_arco/prgr/Anaconda/lib/python2.7 set_attrs中的/site-packages/pandas/io/pytables.pyc(self)3050 self.attrs.index_cols = self.index_cols()305 1 self.attrs.values_cols = self.values_cols() - > 3052 self.attrs.non_index_axes = self.non_index_axes 3053 self.attrs.data_columns = self.data_columns 3054 self.attrs.nan_rep = self.nan_rep / nas1 / Data_arco / prgr setattr中的/Anaconda/lib/python2.7/site-packages/tables/attributeset.pyc(self,name,value)459 460#设置属性 . - > 461 self._g__setattr(name,value)462 463#记录新属性添加 . /nas1/Data_arco/prgr/Anaconda/lib/python2.7/site-packages/tables/attributeset.pyc in _g__setattr(self,name,value)401 value = stvalue [()] 402 - > 403 self._g_setattr( self._v_node,name,stvalue)404 405#新属性或值 . 将它引入tables.hdf5extension.AttributeSet._g_setattr(tables / hdf5extension.c:7917)中的本地tables / hdf5extension.pyx()HDF5ExtError:HDF5错误返回跟踪文件“H5A.c”,第259行,H5Acreate2无法创建属性文件“H5Aint.c”,第275行,在H5A_create中无法在对象头文件“H5Oattribute.c”中创建属性,第347行,在H5O_attr_create中无法在头文件“H5Omessage.c”中创建新属性,第224行, H5O_msg_append_real无法创建新消息文件“H5Omessage.c”,行1945,在H5O_msg_alloc无法为消息文件“H5Oalloc.c”分配空间,行1142,在H5O_alloc对象头消息太大HDF5结束错误返回跟踪可以' t在节点中设置属性'non_index_axes':/ x(组)'' .