Question

如何使用Python从IBM COS读取Parquet文件的元数据（带有类型的列名称）？

我找到的唯一方法：

           import pyarrow.parquet as pq
           import s3fs
           s3 = s3fs.S3FileSystem(anon=False, key='xxx', secret='xxx',
                   client_kwargs={'endpoint_url':
                                      "https://s3-api.us-geo.objectstorage.softlayer.net"}

           schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).read().schema

但是它会读取整个文件（我认为）。

也许还有另一种方法可以从位于IBM COS的Parquet文件中获取元数据？

如果我使用

       schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).schema

它返回不同的数据类型。对于字符串：BYTE_ARRAY

以及时间戳：INT96

奇怪...

Answer 1

已解决：

schema = pq.ParquetDataset(bucket, filesystem=s3).schema.to_arrow_schema()

如何使用Python从IBM Cloud Object Storage中读取Parquet文件的元数据？

1 个答案: