我当前的问题是,从动态文件中的小文件写入s3会花费很多时间(100,000行带有100列的csv需要一个多小时。我正在尝试写入镶木地板和csv,所以我想这是两次写操作,但仍然需要很长时间,我的代码有问题吗?还是pyspark通常这么慢?
应该注意的是,我正在从齐柏林飞艇笔记本电脑+开发端点(5个DPU)测试我的脚本,以规避10分钟的冷启动,但是我希望这不是它这么慢的原因。我正在使用spark 2.4和python3。
%pyspark
import boto3
import sys
import time
import uuid
from datetime import datetime
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import input_file_name
def some_mapping(rec):
# does something trivial
start = time.time()
print("Starting")
args = {
"key": "2000.csv",
"input_bucket": "my-input-bucket",
"output_bucket": "my-output-bucket",
}
output_path = args["output_bucket"]
connection_options = {"path": output_path}
s3 = boto3.resource("s3")
input_bucket = s3.Bucket(args["input_bucket"])
db = boto3.resource("dynamodb", region_name="us-east-1")
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
connection_options={"paths": ["s3://{}/{}".format(args["input_bucket"], args["key"])]},
format="csv",
format_options={
"withHeader": True,
"separator": ","
}
)
mapped_DyF = DyF.map(some_mapping)
# Write to s3
end = time.time()
print("Time: ",end-start) #Transformation takes less than 30 seconds
mapped_DyF.write(connection_type="s3",
connection_options={"path": "{}/parquet".format(args["output_bucket"])},
format="parquet")
end2 = time.time() # Takes forever
print("Time: ",end2-end)
mapped_DyF.write(connection_type="s3",
connection_options={"path": "{}/csv".format(args["output_bucket"]},
format="csv")
end3 = time.time()
print("Time: ",end3-start2) # Also takes forever
print("Time: ",end-start) # Total time is > 1 hour.