我有一个非常简单的Glue脚本,该脚本读取s3存储桶中的多个实木复合地板文件,并尝试使用AWS Athena对其进行分区以进行更快的查询。
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val dyf = glueContext.getSourceWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://parquet-not-partitioned/")), format = "parquet" ).getDynamicFrame()
var newDF = dyf.toDF()
newDF = newDF.withColumn("year", year(col("Test_time"))).withColumn("month", month(col("Test_time"))).withColumn("day", dayofmonth(col("Test_time")))
val partitioned_DF = newDF.repartition(col("year"), col("month"), col("day"), col("Test"))
val timestamped = DynamicFrame(partitioned_DF, glueContext)
val datasink4 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://parquet_partiotioned/", "partitionKeys" -> Seq("year", "month", "day","Device_Under_Test"))), transformationContext = "datasink4", format = "glueparquet", formatOptions = JsonOptions(Map("compression" -> "snappy", "blockSize" -> 134217728, "pageSize" -> 1048576 ))).writeDynamicFrame(timestamped)
Job.commit()
}
}
请注意:我正在根据一些列进行分区,然后在写入S3时再次对其进行分区,以使生成的文件数量更少。
如果我跳过“重新分区”步骤,该程序将运行。但这会导致70K数量的50k大小的文件。这使雅典娜查询非常慢。
在重新分配步骤中,出现以下错误:
org.apache.spark.SparkException Job aborted due to stage failure:
Task 119 in stage 2.0 failed 4 times, most recent failure:
Lost task 119.3 in stage 2.0 (TID 2205, xx.xx-1.compute.internal, executor 89): ExecutorLostFailure (executor 89 exited caused by one of the running tasks) Reason:
Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
有人可以分享一些创新的想法来对数据进行分区和写入,而不会生成很多小型文件吗?