Question

我正在尝试找出哪种方法是使用S3向(Py)Spark写入数据的最佳方法。

从S3存储桶中读取数据似乎没有问题，但是当我需要编写它时确实很慢。

我已经像这样启动了spark壳（包括hadoop-aws包）：

AWS_ACCESS_KEY_ID=<key_id> AWS_SECRET_ACCESS_KEY=<secret_key> pyspark --packages org.apache.hadoop:hadoop-aws:3.2.0

这是示例应用程序

# Load several csv files from S3 to a Dataframe (no problems here)
df = spark.read.csv(path='s3a://mybucket/data/*.csv', sep=',')
df.show()

# Some processing
result_df = do_some_processing(df)
result_df.cache()
result_df.show()

# Write to S3
result_df.write.partitionBy('my_column').csv(path='s3a://mybucket/output', sep=',')  # This is really slow

当我尝试写S3时，收到以下警告：

20/10/28 15:34:02 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.

是否需要更改任何设置才能有效写入S3？由于现在真的很慢，所以大约需要10分钟才能将100个小文件写入S3。

Answer 1

事实证明，您必须手动指定提交者（否则将使用默认的提交者，该提交者并未针对S3优化）：

result_df \
    .write \
    .partitionBy('my_column') \
    .option('fs.s3a.committer.name', 'partitioned') \
    .option('fs.s3a.committer.staging.conflict-mode', 'replace') \
    .option("fs.s3a.fast.upload.buffer", "bytebuffer") \ # Buffer in memory instead of disk, potentially faster but more memory intensive
    .mode('overwrite') \
    .csv(path='s3a://mybucket/output', sep=',')

Spark：如何有效地将数据帧写入S3

1 个答案: