将动态数据框转换为Spark的AWS胶合错误

时间:2020-05-13 15:33:04

标签: amazon-web-services apache-spark aws-glue

我正在使用AWS Glue搜寻器将S3中的某些数据读取到表中。
然后,我想使用AWS Glue作业进行一些转换。我可以修改并运行一个小文件的脚本,但是当我尝试在较大的数据上运行它时,出现以下错误,似乎是在抱怨将动态框架转换为Spark数据框架。我什至不知道如何开始调试它。
我在这里没有看到很多相关的文章,仅涉及sparkDF-> Dynamic框架。

No older events found for the selected filter. clear filter.

18:49:31
er$$anonfun$init$1.apply(GrokReader.scala:62) at scala.collection.Iterator$$anon$9.next(Iterator.scala:162) at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:599) at com.amazonaws.services.glue.readers.GrokReader.hasNext(GrokReader.scala:117) at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:73) at org.apache.spark.rdd.NewHadoopR
er$$anonfun$init$1.apply(GrokReader.scala:62)
    at scala.collection.Iterator$$anon$9.next(Iterator.scala:162)
    at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:599)
    at com.amazonaws.services.glue.readers.GrokReader.hasNext(GrokReader.scala:117)
    at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:73)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:230)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
    at scala.collection.AbstractIterator.aggregate(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1145)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1145)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1146)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1146)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2020-05-12 18:49:22,434 INFO  [Thread-9] scheduler.DAGScheduler (Logging.scala:logInfo(54)) - Job 0 failed: fromRDD at DynamicFrame.scala:241, took 8327.404883 s
2020-05-12 18:49:22,450 WARN  [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9983.0 in stage 0.0 (TID 9986, ip-172-32-50-149.us-west-2.compute.internal, executor 2): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,451 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 10000.0 in stage 0.0 (TID 10003, ip-172-32-50-149.us-west-2.compute.internal, executor 2): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,451 WARN  [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9986.0 in stage 0.0 (TID 9989, ip-172-32-50-149.us-west-2.compute.internal, executor 2): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,451 WARN  [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9985.0 in stage 0.0 (TID 9988, ip-172-32-50-149.us-west-2.compute.internal, executor 2): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,451 WARN  [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9864.0 in stage 0.0 (TID 9864, ip-172-32-62-222.us-west-2.compute.internal, executor 5): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,454 INFO  [dispatcher-event-loop-3] storage.BlockManagerInfo (Logging.scala:logInfo(54)) - Added broadcast_25_piece0 in memory on ip-172-32-56-53.us-west-2.compute.internal:34837 (size: 32.1 KB, free: 2.8 GB)
2020-05-12 18:49:22,455 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9900.0 in stage 0.0 (TID 9900, ip-172-32-62-222.us-west-2.compute.internal, executor 5): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,456 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9991.0 in stage 0.0 (TID 9994, ip-172-32-56-53.us-west-2.compute.internal, executor 4): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,456 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9949.0 in stage 0.0 (TID 9949, ip-172-32-56-53.us-west-2.compute.internal, executor 4): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,456 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9975.0 in stage 0.0 (TID 9977, ip-172-32-62-222.us-west-2.compute.internal, executor 5): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,456 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9995.0 in stage 0.0 (TID 9998, ip-172-32-62-222.us-west-2.compute.internal, executor 7): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,456 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 10001.0 in stage 0.0 (TID 10004, ip-172-32-62-222.us-west-2.compute.internal, executor 5): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,457 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9993.0 in stage 0.0 (TID 9996, ip-172-32-62-222.us-west-2.compute.internal, executor 7): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,457 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9939.0 in stage 0.0 (TID 9939, ip-172-32-62-222.us-west-2.compute.internal, executor 7): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,457 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9930.0 in stage 0.0 (TID 9930, ip-172-32-62-222.us-west-2.compute.internal, executor 7): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,457 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9998.0 in stage 0.0 (TID 10001, ip-172-32-54-163.us-west-2.compute.internal, executor 6): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,462 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9965.0 in stage 0.0 (TID 9967, ip-172-32-56-53.us-west-2.compute.internal, executor 1): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,463 WARN  [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9934.0 in stage 0.0 (TID 9934, ip-172-32-56-53.us-west-2.compute.internal, executor 1): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,463 WARN  [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9990.0 in stage 0.0 (TID 9993, ip-172-32-56-53.us-west-2.compute.internal, executor 1): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,464 WARN  [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9992.0 in stage 0.0 (TID 9995, ip-172-32-56-53.us-west-2.compute.internal, executor 4): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,464 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9967.0 in stage 0.0 (TID 9969, ip-172-32-56-53.us-west-2.compute.internal, executor 4): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,464 WARN  [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9982.0 in stage 0.0 (TID 9985, ip-172-32-54-163.us-west-2.compute.internal, executor 3): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,464 WARN  [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9999.0 in stage 0.0 (TID 10002, ip-172-32-54-163.us-west-2.compute.internal, executor 3): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,464 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9984.0 in stage 0.0 (TID 9987, ip-172-32-54-163.us-west-2.compute.internal, executor 3): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,464 WARN  [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9966.0 in stage 0.0 (TID 9968, ip-172-32-54-163.us-west-2.compute.internal, executor 3): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,467 WARN  [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9996.0 in stage 0.0 (TID 9999, ip-172-32-54-163.us-west-2.compute.internal, executor 6): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,474 WARN  [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9960.0 in stage 0.0 (TID 9962, ip-172-32-54-163.us-west-2.compute.internal, executor 6): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,474 WARN  [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 9980.0 in stage 0.0 (TID 9982, ip-172-32-54-163.us-west-2.compute.internal, executor 6): TaskKilled (Stage cancelled)
2020-05-12 18:49:22,514 INFO  [dispatcher-event-loop-0] yarn.YarnAllocator (Logging.scala:logInfo(54)) - Driver requested a total number of 1 executor(s).
Traceback (most recent call last):
  File "script_2020-05-12-16-29-01.py", line 30, in <module>
    dns = datasource0.toDF()
  File "/mnt/yarn/usercache/root/appcache/application_1589300850182_0001/container_1589300850182_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 147, in toDF
    return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)
  File "/mnt/yarn/usercache/root/appcache/application_1589300850182_0001/container_1589300850182_0001_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/yarn/usercache/root/appcache/application_1589300850182_0001/container_1589300850182_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/yarn/usercache/root/appcache/application_1589300850182_0001/container_1589300850182_0001_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o66.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9929 in stage 0.0 failed 4 times, most recent failure: Lost task 9929.3 in stage 0.0 (TID 9983, ip-172-32-56-53.us-west-2.compute.internal, executor 1): java.io.IOException: too many length or distance symbols
    at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
    at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:225)
    at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at com.amazonaws.services.glue.readers.BufferedStream.read(DynamicRecordReader.scala:91)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at com.amazonaws.services.glue.readers.GrokReader$$anonfun$init$1$$anonfun$apply$1.apply$mcV$sp(GrokReader.scala:68)
    at scala.util.control.Breaks.breakable(Breaks.scala:38)
    at com.amazonaws.services.glue.readers.GrokReader$$anonfun$init$1.apply(GrokReader.scala:66)
    at com.amazonaws.services.glue.readers.GrokReader$$anonfun$init$1.apply(GrokReader.scala:62)
    at scala.collection.Iterator$$anon$9.next(Iterator.scala:162)
    at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:599)
    at com.amazonaws.services.glue.readers.GrokReader.hasNext(GrokReader.scala:117)
    at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:73)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:230)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
    at scala.collection.AbstractIterator.aggregate(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1145)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1145)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1146)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1146)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
    at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.fold(RDD.scala:1092)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1161)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1137)
    at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:72)
    at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:241)
    at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:227)
    at com.amazonaws.services.glue.DynamicFrame.toDF(DynamicFrame.scala:290)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: too many length or distance symbols
    at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
    at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:225)
    at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at com.amazonaws.services.glue.readers.BufferedStream.read(DynamicRecordReader.scala:91)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at com.amazonaws.services.glue.readers.GrokReader$$anonfun$init$1$$anonfun$apply$1.apply$mcV$sp(GrokReader.scala:68)
    at scala.util.control.Breaks.breakable(Breaks.scala:38)
    at com.amazonaws.services.glue.readers.GrokReader$$anonfun$init$1.apply(GrokReader.scala:66)
    at com.amazonaws.services.glue.readers.GrokReader$$anonfun$init$1.apply(GrokReader.scala:62)
    at scala.collection.Iterator$$anon$9.next(Iterator.scala:162)
    at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:599)
    at com.amazonaws.services.glue.readers.GrokReader.hasNext(GrokReader.scala:117)
    at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:73)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:230)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
    at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
    at scala.collection.AbstractIterator.aggregate(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1145)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1145)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1146)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1146)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

2020-05-12 18:49:22,587 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(70)) - User application exited with status 1
2020-05-12 18:49:22,588 INFO  [Driver] yarn.ApplicationMaster (Logging.scala:logInfo(54)) - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
2020-05-12 18:49:22,591 INFO  [pool-4-thread-1] spark.SparkContext (Logging.scala:logInfo(54)) - Invoking stop() from shutdown hook
2020-05-12 18:49:22,594 INFO  [pool-4-thread-1] server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped Spark@3a4d5cae{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
2020-05-12 18:49:22,595 INFO  [pool-4-thread-1] ui.SparkUI (Logging.scala:logInfo(54)) - Stopped Spark web UI at http://ip-172-32-50-149.us-west-2.compute.internal:40355
2020-05-12 18:49:22,597 INFO  [dispatcher-event-loop-2] yarn.YarnAllocator (Logging.scala:logInfo(54)) - Driver requested a total number of 0 executor(s).
2020-05-12 18:49:22,598 INFO  [pool-4-thread-1] cluster.YarnClusterSchedulerBackend (Logging.scala:logInfo(54)) - Shutting down all executors
2020-05-12 18:49:22,598 INFO  [dispatcher-event-loop-3] cluster.YarnSchedulerBackend$YarnDriverEndpoint (Logging.scala:logInfo(54)) - Asking each executor to shut down
2020-05-12 18:49:22,600 INFO  [pool-4-thread-1] cluster.SchedulerExtensionServices (Logging.scala:logInfo(54)) - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2020-05-12 18:49:22,604 INFO  [dispatcher-event-loop-3] spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(54)) - MapOutputTrackerMasterEndpoint stopped!
2020-05-12 18:49:22,616 INFO  [pool-4-thread-1] memory.MemoryStore (Logging.scala:logInfo(54)) - MemoryStore cleared
2020-05-12 18:49:22,616 INFO  [pool-4-thread-1] storage.BlockManager (Logging.scala:logInfo(54)) - BlockManager stopped
2020-05-12 18:49:22,617 INFO  [pool-4-thread-1] storage.BlockManagerMaster (Logging.scala:logInfo(54)) - BlockManagerMaster stopped
2020-05-12 18:49:22,618 INFO  [dispatcher-event-loop-2] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(54)) - OutputCommitCoordinator stopped!
2020-05-12 18:49:22,621 INFO  [pool-4-thread-1] spark.SparkContext (Logging.scala:logInfo(54)) - Successfully stopped SparkContext
2020-05-12 18:49:22,623 INFO  [pool-4-thread-1] yarn.ApplicationMaster (Logging.scala:logInfo(54)) - Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1)
2020-05-12 18:49:22,631 INFO  [pool-4-thread-1] impl.AMRMClientImpl (AMRMClientImpl.java:unregisterApplicationMaster(476)) - Waiting for application to be successfully unregistered.
2020-05-12 18:49:22,733 INFO  [pool-4-thread-1] yarn.ApplicationMaster  ```

0 个答案:

没有答案