我要做的是将所有文件从S3(AWS Storage)转换为拼花格式,然后将它们重新保存到s3中。
我无法提出从s3转换所有文件的想法。请帮帮我!!
import boto3
import pandas as pd
import pyarrow as pa
from s3fs import S3FileSystem
import pyarrow.parquet as pq
s3 = boto3.client('s3',region_name='us-east-2')
obj = s3.get_object(Bucket='dstest-s3', Key='dstest/movies.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
output_file = "s3://dstest-s3/dstest/parquetconversion1.parquet"
s3 = S3FileSystem()
pq.write_to_dataset(table=table, root_path=output_file, filesystem=s3)
print("File converted from CSV to parquet completed")
答案 0 :(得分:0)
基本上,您基本上需要使用list_objects_v2从存储桶中获取所有密钥,并遍历密钥,然后进行下载,转换和上传。
s3 = bt3.get_client('s3')
response = s3.list_objects_v2(
Bucket='dstest-s3',
Prefix='dstest/'
)
for s3_obj in response['Contents']:
obj = s3.get_object(Bucket='dstest-s3', Key=s3_obj['Key'])
# Do your converting, and uploading here