AWS Glue是否可扩展?

时间:2018-06-09 18:59:34

标签: scala amazon-web-services amazon-s3 apache-spark-sql aws-glue

我在使用胶水时已经涵盖了所有必需的信息,如果您需要更多信息,请与我们联系。

以下是我的情景:

  

aws s3 ls s3:// bucuketname / --recursive --profile production | grep的   自动| wc -l <​​/ p>

2487

对于转型,不超过2487个s3感兴趣的对象。

  

aws s3api list-objects --bucket bucketname --output json --query   &#34; [sum(Contents []。Size),length(Contents [])]&#34; - 资料生产|   awk&#39; NR!= 2 {print $ 0; next} NR == 2 {print $ 0 / 1024/1024 / 1024&#34; GB&#34;}&#39;

[
344.768 GB
    3829
]

每个s3对象不超过100MB ,它是压缩的json

3829是对象的总数,但我只对2487个对象感兴趣进行处理。

Scala代码:

val glueContext: GlueContext = new GlueContext(sc)
val auto01: DynamicFrame = glueContext.getCatalogSource(database = "jsondb", tableName = "01").getDynamicFrame()
auto01.printSchema()

尝试获取架构,

18/06/09 18:31:44 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 32, ip-172-31-16-40.ec2.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/09 18:31:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
..
..
..

18/06/09 18:34:13 WARN ExecutorAllocationManager: Attempted to mark unknown executor 12 idle
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 44, ip-172-31-16-40.ec2.internal, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.0 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
  at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
  at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:57)
  at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:235)
  at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:223)
  at com.amazonaws.services.glue.DynamicFrame.printSchema(DynamicFrame.scala:244)
  ... 48 elided

我在这里想念胶水吗?

0 个答案:

没有答案