AWS Glue-在S3存储桶中创建分区数据

时间:2020-03-02 15:49:07

标签: amazon-web-services apache-spark amazon-s3 amazon-emr aws-glue

我有一个非常简单的Glue脚本,该脚本读取s3存储桶中的多个实木复合地板文件,并尝试使用AWS Athena对其进行分区以进行更快的查询。

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    val dyf = glueContext.getSourceWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://parquet-not-partitioned/")), format = "parquet" ).getDynamicFrame()
    var newDF = dyf.toDF()
    newDF = newDF.withColumn("year", year(col("Test_time"))).withColumn("month", month(col("Test_time"))).withColumn("day", dayofmonth(col("Test_time")))
    val partitioned_DF =  newDF.repartition(col("year"), col("month"), col("day"), col("Test"))
    val timestamped = DynamicFrame(partitioned_DF, glueContext)
    val datasink4 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://parquet_partiotioned/", "partitionKeys" -> Seq("year", "month", "day","Device_Under_Test"))), transformationContext = "datasink4", format = "glueparquet", formatOptions = JsonOptions(Map("compression" -> "snappy", "blockSize" -> 134217728, "pageSize" -> 1048576 ))).writeDynamicFrame(timestamped)
    Job.commit()
  }
}

请注意:我正在根据一些列进行分区,然后在写入S3时再次对其进行分区,以使生成的文件数量更少。

如果我跳过“重新分区”步骤,该程序将运行。但这会导致70K数量的50k大小的文件。这使雅典娜查询非常慢。

在重新分配步骤中,出现以下错误:

org.apache.spark.SparkException Job aborted due to stage failure:
Task 119 in stage 2.0 failed 4 times, most recent failure:
Lost task 119.3 in stage 2.0 (TID 2205, xx.xx-1.compute.internal, executor 89): ExecutorLostFailure (executor 89 exited caused by one of the running tasks) Reason:
Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

有人可以分享一些创新的想法来对数据进行分区和写入,而不会生成很多小型文件吗?

0 个答案:

没有答案