Question

我想通过指定s3 url将Panda的数据框直接上传到S3。我有一个publisher，我想指定用于此次上传的个人资料的名称。

自multi-profile AWS environment以来，我想知道是否还有其他方法可以在代码中指定（非默认）区域。

我无法在it is not possible to specify region in the s3 url库中提交任何此类选项，import pandas as pd data = [1, 2, 3] df = pd.DataFrame() # I would like to specify non-default profile to use here s3_url = 's3://my_bucket/path/to/file.parquet' df.to_parquet(s3_url)在内部使用该库来上传到s3。

请注意，我不想使用环境变量，也不想修改AWS凭证文件中的默认配置。

pd.read_csv(..., comment='#',...)? Sample code is below.

# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)

# Print the output of df1.head()
print(df1.head(5))

# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')

# Print the output of df2.head()
print(df2.head())

# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)

Answer 1

使用session

    session = boto3.Session(profile_name='dev')
    s3_client = session.client('s3')

Save DataFrame到镶木地板文件

    df.to_parquet( parquet_pandas_file )

Upload文件到S3

    with open( parquet_pandas_file, 'rb' ) as s3_source_data:
        s3.upload_fileobj(data, 'bucket_name', 'bucket_key_name' )

Answer 2

使用s3fs命令时，使用以下代码设置配置文件名称

fs = s3fs.S3FileSystem(profile_name='<profile name>')
with fs.open('s3://bucketname/root1/file.csv', 'w') as f:
    df.to_csv(f)

指定将Pandas数据帧上载到S3时要使用的AWS配置文件名称

2 个答案: