可以在压缩的镶木地板文件中读取python fastparquet模块吗?

时间:2017-02-14 19:48:33

标签: python pandas parquet

我们的镶木地板文件存储在aws S3存储桶中,并由SNAPPY压缩。 我能够使用python fastparquet模块读取未压缩版本的镶木地板文件,但不能读取压缩版本。

这是我用于未压缩的代码

s3 = s3fs.S3FileSystem(key='XESF',    secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()

返回没有错误,但是当我尝试读取文件的snappy压缩版本时:

pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)

我收到to_pandas()

的错误
df=pf.to_pandas()

错误消息

  

KeyErrorTraceback(最近一次调用最后一次)    in()   ----> 1 df = pf.to_pandas()

     

/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in   to_pandas(自我,列,类别,过滤器,索引)       293 for views(item,v)in views.items()}       294 self.read_row_group(rg,columns,categories,infile = f,    - > 295 index = index,assign = parts)       296开始+ = rg.num_rows       297否则:

     

/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in   read_row_group(self,rg,columns,categories,infile,index,assign)       151 core.read_row_group(       152 infile,rg,columns,categories,self.helper,self.cats,    - > 153 self.selfmade,index = index,assign = assign)       154如果退回:       155返回df

     

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in   read_row_group(文件,rg,列,类别,schema_helper,cats,   自制,索引,分配)       300引发RuntimeError('Going with pre-allocation!')       301 read_row_group_arrays(文件,rg,列,类别,schema_helper,    - > 302只猫,自制,分配=分配)       303       猫猫用304:

     

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in   read_row_group_arrays(file,rg,columns,categories,schema_helper,   猫,自制,分配)       289 read_col(column,schema_helper,file,use_cat = use,       290 selfmade = selfmade,assign = out [name],    - > 291 catdef = out [name +' - catdef']如果使用其他无)       292       293

     

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in   read_col(column,schema_helper,infile,use_cat,grab_dict,selfmade,   分配,catdef)       196 dic =无       197如果ph.type == parquet_thrift.PageType.DICTIONARY_PAGE:    - > 198 dic = np.array(read_dictionary_page(infile,schema_helper,ph,cmd))       199 ph = read_thrift(infile,parquet_thrift.PageHeader)       200 dic = convert(dic,se)

     

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in   read_dictionary_page(file_obj,schema_helper,page_header,   column_metadata)       152使用纯编码使用数据并返回值数组。       153“”“    - > 154 raw_bytes = _read_page(file_obj,page_header,column_metadata)       155如果column_metadata.type == parquet_thrift.Type.BYTE_ARRAY:       156#没有更快的方式来读取变长字符串?

     

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in   _read_page(file_obj,page_header,column_metadata)        28“”“从给定的文件对象中读取数据页面并将其转换为原始的未压缩字节(如有必要)。”“”        29 raw_bytes = file_obj.read(page_header.compressed_pa​​ge_size)   ---> 30 raw_bytes = decompress_data(raw_bytes,column_metadata.codec)        31        32断言len(raw_bytes)== page_header.uncompressed_pa​​ge_size,\

     

/opt/conda/lib/python3.5/site-packages/fastparquet/compression.py in   decompress_data(数据,算法)        48 def decompress_data(data,algorithm ='gzip'):        49如果isinstance(algorithm,int):   ---> 50 algorithm = rev_map [algorithm]        51如果algorithm.upper()不在解压缩中:        52引发RuntimeError(“解压缩'%s'不可用。选项:%s”%

     

KeyError:1

1 个答案:

答案 0 :(得分:14)

该错误可能表示在您的系统上找不到用于解压缩SNAPPY的库 - 尽管显然错误消息可能更清楚!

根据您的系统,以下行可能会为您解决此问题:

typedef enum {
    COMMON = 0,
    STRINGS,
    KEY,
    PRECUSSIVE,
    GUITAR,
    KEYBOARD,
    BASS,
    PIANO,
    DRUMS,
    _INST_MAX
} instrument_classification_t;
    static const int * const instrument_class_hierarchy[] = {
    [COMMON] = {STRINGS, KEY, PRECUSSIVE, _INST_MAX},
    [STRINGS] = {GUITAR, BASS, _INST_MAX},
    [KEY] = {PIANO, KEYBOARD, _INST_MAX},
    [PRECUSSIVE] = {DRUMS, _INST_MAX},
    [GUITAR] = NULL,
    [KEYBOARD] = NULL,
    [BASS] = NULL,
    [PIANO] = NULL,
    [DRUMS] = NULL
};

main.c:166:3: warning: braces around scalar initializer
   [COMMON] = {STRINGS, KEY, PRECUSSIVE, _INST_MAX},
   ^
main.c:166:3: note: (near initialization for 'instrument_class_hierarchy[0]')
main.c:166:15: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
   [COMMON] = {STRINGS, KEY, PRECUSSIVE, _INST_MAX},
           ^
main.c:166:15: note: (near initialization for 'instrument_class_hierarchy[0]')
main.c:166:24: warning: excess elements in scalar initializer
   [COMMON] = {STRINGS, KEY, PRECUSSIVE, _INST_MAX},
                    ^
main.c:166:24: note: (near initialization for 'instrument_class_hierarchy[0]')
main.c:166:29: warning: excess elements in scalar initializer
   [COMMON] = {STRINGS, KEY, PRECUSSIVE, _INST_MAX},
                         ^
main.c:166:29: note: (near initialization for 'instrument_class_hierarchy[0]')
main.c:166:41: warning: excess elements in scalar initializer
   [COMMON] = {STRINGS, KEY, PRECUSSIVE, _INST_MAX},
                                     ^
main.c:166:41: note: (near initialization for 'instrument_class_hierarchy[0]')
main.c:167:3: warning: braces around scalar initializer
   [STRINGS] = {GUITAR, BASS, _INST_MAX},
   ^
main.c:167:3: note: (near initialization for 'instrument_class_hierarchy[1]')
main.c:167:16: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
   [STRINGS] = {GUITAR, BASS, _INST_MAX},
            ^
main.c:167:16: note: (near initialization for 'instrument_class_hierarchy[1]')
main.c:167:24: warning: excess elements in scalar initializer
   [STRINGS] = {GUITAR, BASS, _INST_MAX},
                    ^
main.c:167:24: note: (near initialization for 'instrument_class_hierarchy[1]')
main.c:167:30: warning: excess elements in scalar initializer
   [STRINGS] = {GUITAR, BASS, _INST_MAX},
                          ^
main.c:167:30: note: (near initialization for 'instrument_class_hierarchy[1]')
main.c:168:3: warning: braces around scalar initializer
   [KEY] = {PIANO, KEYBOARD, _INST_MAX},
   ^
main.c:168:3: note: (near initialization for 'instrument_class_hierarchy[2]')
main.c:168:12: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
   [KEY] = {PIANO, KEYBOARD, _INST_MAX},
        ^
main.c:168:12: note: (near initialization for 'instrument_class_hierarchy[2]')
main.c:168:19: warning: excess elements in scalar initializer
   [KEY] = {PIANO, KEYBOARD, _INST_MAX},
               ^
main.c:168:19: note: (near initialization for 'instrument_class_hierarchy[2]')
main.c:168:29: warning: excess elements in scalar initializer
   [KEY] = {PIANO, KEYBOARD, _INST_MAX},
                         ^
main.c:168:29: note: (near initialization for 'instrument_class_hierarchy[2]')
main.c:169:3: warning: braces around scalar initializer
   [PRECUSSIVE] = {DRUMS, _INST_MAX},
   ^
main.c:169:3: note: (near initialization for 'instrument_class_hierarchy[3]')
main.c:169:19: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
   [PRECUSSIVE] = {DRUMS, _INST_MAX},
               ^
main.c:169:19: note: (near initialization for 'instrument_class_hierarchy[3]')
main.c:169:26: warning: excess elements in scalar initializer
   [PRECUSSIVE] = {DRUMS, _INST_MAX},

如果您使用的是Windows,则构建链可能无法运行,也许您需要从here进行安装。