我尝试从AWS S3读取实木复合地板文件。
相同的代码可在我的Windows计算机上使用。
Google搜索没有结果。
熊猫应使用fastparquet来构建数据框。 fastparquet已安装。
代码:
import boto3
import pandas as pd
def get_parquet_from_s3(bucket_name, file_name):
"""
:param bucket_name:
:param file_name:
:return:
"""
df = pd.read_parquet('s3://{}/{}'.format(bucket_name, file_name))
print(df.head())
get_parquet_from_s3('my_bucket_name','my_file_name')
以下是我的例外情况:
/home/ubuntu/.local/lib/python3.6/site-packages/numba/errors.py:131: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9
warnings.warn(msg)
Traceback (most recent call last):
File "test_pd_read_parq.py", line 15, in <module>
get_parquet_from_s3('my_bucket_name','my_file_name')
File "test_pd_read_parq.py", line 12, in get_parquet_from_s3
df = pd.read_parquet('s3://{}/{}'.format(bucket_name, file_name))
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 294, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 192, in read
parquet_file = self.api.ParquetFile(path, open_with=s3.s3.open)
AttributeError: 'S3File' object has no attribute 's3'
软件和操作系统版本
python : 3.6
pandas : 0.25.0
s3fs : 0.3.1
ubuntu : 18.04
fastparquet : 0.3.1
boto3 : 1.9.198
botocore : 1.12.198
解决方法
import s3fs
from fastparquet import ParquetFile
def get_parquet_from_s3(bucket_name, file_name
s3 = s3fs.S3FileSystem()
pf = ParquetFile('{}/{}'.format(bucket_name, file_name), open_with=s3.open)
df = pf.to_pandas()
答案 0 :(得分:1)
对于python 3.6 +,AWS有一个名为aws-data-wrangler的库,该库有助于实现Pandas / S3 / Parquet之间的集成
安装do;
pip install awswrangler
从s3读取实木复合地板;
import awswrangler as wr
df = wr.pandas.read_parquet(path="s3://my-bucket/my/path/to/parquet-file.parquet")
答案 1 :(得分:0)
您可以使用s3fs和Pyarrow从S3读取实木复合地板文件,如下所示。
import s3fs
import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
pandas_dataframe=pq.ParquetDataset('s3://bucket/file.parquet',filesystem=s3).read_pandas().to_pandas()