python fastparquet模块可以在压缩的拼花文件中读取吗？-Java 学习之路

我们的镶木地板文件存储在aws S3存储桶中，并由SNAPPY压缩 . 我能够使用python fastparquet模块读取未压缩版本的镶木地板文件，但不能读取压缩版本 .

这是我用于未压缩的代码

s3 = s3fs.S3FileSystem(key='XESF',    secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()

这返回没有错误但是当我尝试读取文件的snappy压缩版本时：

pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)

我得到了to_pandas（）的错误

df=pf.to_pandas()

错误信息

to_pandas中的（）----> 1 df = pf.to_pandas（）/opt/conda/lib/python3.5/site-packages/fastparquet/api.py中的KeyErrorTraceback（最近一次调用last）（self，columns ，类别，过滤器，索引）293 for（name，v）in views.items（）} 294 self.read_row_group（rg，columns，categories，infile = f， - > 295 index = index，assign = parts）296 start = rg.num_rows 297否则：read_row_group中的/opt/conda/lib/python3.5/site-packages/fastparquet/api.py（self，rg，columns，categories，infile，index，assign）151 core.read_row_group（152 infile，rg，columns，categories，self.helper，self.cats， - > 153 self.selfmade，index = index，assign = assign）154 if ret：155 return df /opt/conda/lib/python3.5/ read_row_group中的site-packages / fastparquet / core.py（文件，rg，列，类别，schema_helper，cats，selfmade，index，assign）300引发RuntimeError（'Going with pre-allocation！'）301 read_row_group_arrays（file，rg，列，类别，schema_helper， - > 302猫，自制，assign = assign）303 304猫猫：/ opt / cond read_row_group_arrays中的a / lib / python3.5 / site-packages / fastparquet / core.py（文件，rg，列，类别，schema_helper，cats，selfmade，assign）289 read_col（column，schema_helper，file，use_cat = use，290 selfmade = selfmade，assign = out [name]， - > 291 catdef = out [name'-catdef'] if else else None）292 293 /opt/conda/lib/python3.5/site-packages/fastparquet/core read_col中的.py（列，schema_helper，infile，use_cat，grab_dict，selfmade，assign，catdef）196 dic =无197如果ph.type == parquet_thrift.PageType.DICTIONARY_PAGE： - > 198 dic = np.array（read_dictionary_page（ infile，schema_helper，ph，cmd））199 ph = read_thrift（infile，parquet_thrift.PageHeader）200 dic = convert（dic，se）/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page（file_obj，schema_helper，page_header，column_metadata）152使用纯编码使用数据并返回值数组 . 153“”“ - > 154 raw_bytes = _read_page（file_obj，page_header，column_metadata）155如果column_metadata.type == parquet_thrift.Type.BYTE_ARRAY：156＃没有更快的方式来读取变长字符串？/ opt / conda / lib _read_page中的/python3.5/site-packages/fastparquet/core.py(file_obj,page_header,column_metadata）28“”“从给定的文件对象中读取数据页并将其转换为原始的未压缩字节（如有必要） . msgstr“”“29 raw_bytes = file_obj.read（page_header.compressed_page_size）---> 30 raw_bytes = decompress_data（raw_bytes，column_metadata.codec）31 32断言len（raw_bytes）== page_header.uncompressed_page_size，\ / opt / conda / lib /在decompress_data（data，algorithm）中的python3.5 / site-packages / fastparquet / compression.py 48 def decompress_data（data，algorithm ='gzip'）：49 if isinstance（algorithm，int）：---> 50 algorithm = rev_map [算法] 51如果algorithm.upper（）不在解压缩中：52引发RuntimeError（“解压缩'％s'不可用 . 选项：％s”％KeyError：1

1 回答

6
该错误可能表示在您的系统上找不到用于解压缩SNAPPY的库 - 尽管显然错误消息可能更清楚！

根据您的系统，以下行可能会为您解决此问题：
```
conda install python-snappy
```
要么
```
pip install python-snappy
```
如果您在Windows上，构建链可能无法正常工作，也许您需要从here安装 .
回复于 2024-04-30T00:08:28+08:00

python fastparquet模块可以在压缩的拼花文件中读取吗？

1 回答

相关问题