我在使用胶水时已经涵盖了所有必需的信息,如果您需要更多信息,请与我们联系。
以下是我的情景:
aws s3 ls s3:// bucuketname / --recursive --profile production | grep的 自动| wc -l </ p>
2487
对于转型,不超过2487个s3感兴趣的对象。
aws s3api list-objects --bucket bucketname --output json --query &#34; [sum(Contents []。Size),length(Contents [])]&#34; - 资料生产| awk&#39; NR!= 2 {print $ 0; next} NR == 2 {print $ 0 / 1024/1024 / 1024&#34; GB&#34;}&#39;
[
344.768 GB
3829
]
每个s3对象不超过100MB ,它是压缩的json 。
3829是对象的总数,但我只对2487个对象感兴趣进行处理。
Scala代码:
val glueContext: GlueContext = new GlueContext(sc)
val auto01: DynamicFrame = glueContext.getCatalogSource(database = "jsondb", tableName = "01").getDynamicFrame()
auto01.printSchema()
尝试获取架构,
18/06/09 18:31:44 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 32, ip-172-31-16-40.ec2.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/09 18:31:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
..
..
..
18/06/09 18:34:13 WARN ExecutorAllocationManager: Attempted to mark unknown executor 12 idle
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 44, ip-172-31-16-40.ec2.internal, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.0 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:57)
at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:235)
at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:223)
at com.amazonaws.services.glue.DynamicFrame.printSchema(DynamicFrame.scala:244)
... 48 elided
我在这里想念胶水吗?