首页 文章

python fastparquet模块可以在压缩的拼花文件中读取吗?

提问于
浏览
3

我们的镶木地板文件存储在aws S3存储桶中,并由SNAPPY压缩 . 我能够使用python fastparquet模块读取未压缩版本的镶木地板文件,但不能读取压缩版本 .

这是我用于未压缩的代码

s3 = s3fs.S3FileSystem(key='XESF',    secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()

这返回没有错误但是当我尝试读取文件的snappy压缩版本时:

pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)

我得到了to_pandas()的错误

df=pf.to_pandas()

错误信息

to_pandas中的()----> 1 df = pf.to_pandas()/opt/conda/lib/python3.5/site-packages/fastparquet/api.py中的KeyErrorTraceback(最近一次调用last)(self,columns ,类别,过滤器,索引)293 for(name,v)in views.items()} 294 self.read_row_group(rg,columns,categories,infile = f, - > 295 index = index,assign = parts)296 start = rg.num_rows 297否则:read_row_group中的/opt/conda/lib/python3.5/site-packages/fastparquet/api.py(self,rg,columns,categories,infile,index,assign)151 core.read_row_group(152 infile,rg,columns,categories,self.helper,self.cats, - > 153 self.selfmade,index = index,assign = assign)154 if ret:155 return df /opt/conda/lib/python3.5/ read_row_group中的site-packages / fastparquet / core.py(文件,rg,列,类别,schema_helper,cats,selfmade,index,assign)300引发RuntimeError('Going with pre-allocation!')301 read_row_group_arrays(file,rg,列,类别,schema_helper, - > 302猫,自制,assign = assign)303 304猫猫:/ opt / cond read_row_group_arrays中的a / lib / python3.5 / site-packages / fastparquet / core.py(文件,rg,列,类别,schema_helper,cats,selfmade,assign)289 read_col(column,schema_helper,file,use_cat = use,290 selfmade = selfmade,assign = out [name], - > 291 catdef = out [name'-catdef'] if else else None)292 293 /opt/conda/lib/python3.5/site-packages/fastparquet/core read_col中的.py(列,schema_helper,infile,use_cat,grab_dict,selfmade,assign,catdef)196 dic =无197如果ph.type == parquet_thrift.PageType.DICTIONARY_PAGE: - > 198 dic = np.array(read_dictionary_page( infile,schema_helper,ph,cmd))199 ph = read_thrift(infile,parquet_thrift.PageHeader)200 dic = convert(dic,se)/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page(file_obj,schema_helper,page_header,column_metadata)152使用纯编码使用数据并返回值数组 . 153“”“ - > 154 raw_bytes = _read_page(file_obj,page_header,column_metadata)155如果column_metadata.type == parquet_thrift.Type.BYTE_ARRAY:156#没有更快的方式来读取变长字符串?/ opt / conda / lib _read_page中的/python3.5/site-packages/fastparquet/core.py(file_obj,page_header,column_metadata)28“”“从给定的文件对象中读取数据页并将其转换为原始的未压缩字节(如有必要) . msgstr“”“29 raw_bytes = file_obj.read(page_header.compressed_page_size)---> 30 raw_bytes = decompress_data(raw_bytes,column_metadata.codec)31 32断言len(raw_bytes)== page_header.uncompressed_page_size,\ / opt / conda / lib /在decompress_data(data,algorithm)中的python3.5 / site-packages / fastparquet / compression.py 48 def decompress_data(data,algorithm ='gzip'):49 if isinstance(algorithm,int):---> 50 algorithm = rev_map [算法] 51如果algorithm.upper()不在解压缩中:52引发RuntimeError(“解压缩'%s'不可用 . 选项:%s”%KeyError:1

1 回答

  • 6

    该错误可能表示在您的系统上找不到用于解压缩SNAPPY的库 - 尽管显然错误消息可能更清楚!

    根据您的系统,以下行可能会为您解决此问题:

    conda install python-snappy
    

    要么

    pip install python-snappy
    

    如果您在Windows上,构建链可能无法正常工作,也许您需要从here安装 .

相关问题