Question

我使用pyathena运行查询，并创建了一个熊猫数据框。有没有办法直接将pandas数据框写入AWS athena数据库？像用于MySQL数据库的data.to_sql一样。

共享下面的数据框代码示例以供参考，需要将其写入AWS athena数据库：

data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})

Answer 1

实现此目标的另一种现代方式（截至2020年2月）是使用aws-data-wrangler库。它授权了数据处理中的许多例行（有时是烦人的）任务。

结合问题中的案例，代码如下所示：

import pandas as pd
import awswrangler as wr

data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})

# Typical Pandas, Numpy or Pyarrow transformation HERE!

wr.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=data,
    database="database",
    path="s3://your-s3-bucket/path/to/new/table",
    partition_cols=["name"],
)

这非常有用，因为aws-data-wrangler知道从路径中解析表名（但是您可以在参数中提供表名）并根据数据框在Glue目录中定义适当的类型。

对于直接使用Athena查询数据到熊猫数据框也有帮助：

df = wr.pandas.read_table(database="dataase", table="table")

所有过程将快速便捷。

Answer 2

AWS Athena的存储为S3。并且它仅从S3文件读取数据。因此，您不能像其他任何数据库一样直接将数据写入Athena数据库。

它不支持insert into ...。

在此处阅读有关雅典娜limitations的更多详细信息。

以下是使其运行的步骤。

1. You need to write the pandas output to a file, 
2. Save the file to S3 location, from where the AWS Athena is reading.

我希望它能给您一些指导。

Answer 3

一种选择是使用：

pandas_df.to_parquet(file, engine="pyarrow)

首先将其保存为拼花格式的临时文件。为此，您需要安装pyarrow依赖项。将文件保存到本地后，您可以使用适用于python的aws sdk将其推送到S3。

现在可以通过执行以下查询在雅典娜中创建一个新表：

    CREATE EXTERNAL TABLE IF NOT EXISTS 'your_new_table'
        (col1 type1, col2 type2)
    PARTITIONED BY (col_partitions_if_neccesary)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    LOCATION 's3 location of your parquet file'
    tblproperties ("parquet.compression"="snappy");

另一个选择是使用pyathena。从他们的官方文档中获取示例：

import pandas as pd
from urllib.parse import quote_plus
from sqlalchemy import create_engine

conn_str = "awsathena+rest://:@athena.{region_name}.amazonaws.com:443/"\
           "{schema_name}?s3_staging_dir={s3_staging_dir}&s3_dir={s3_dir}&compression=snappy"

engine = create_engine(conn_str.format(
    region_name="us-west-2",
    schema_name="YOUR_SCHEMA",
    s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/"),
    s3_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))

df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
df.to_sql("YOUR_TABLE", engine, schema="YOUR_SCHEMA", index=False, if_exists="replace", method="multi")

在这种情况下，需要依赖项sqlalchemy。

将Pandas数据框写入AWS Athena数据库

3 个答案: