从Spark

时间:2017-01-11 04:03:30

标签: apache-spark amazon-s3 emr

我收到错误"部分号码2的上传尝试已达到最大限制:5,将抛出异常并失败"当试图关闭序列文件编写器。异常的完整日志如下:

16/12/30 19:47:01 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/s3/57b63810-c20a-438c-a73f-48d50e0be7d2-0001 94317523 bytes md5: 05ww/fe3pNni9Zvfm+l4Gg== md5hex: d39c30fdf7b7a4d9e2f59bdf9be9781a
16/12/30 19:47:12 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/s3/57b63810-c20a-438c-a73f-48d50e0be7d2-0002 94317523 bytes md5: 05ww/fe3pNni9Zvfm+l4Gg== md5hex: d39c30fdf7b7a4d9e2f59bdf9be9781a
16/12/30 19:47:23 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/s3/57b63810-c20a-438c-a73f-48d50e0be7d2-0003 94317523 bytes md5: 05ww/fe3pNni9Zvfm+l4Gg== md5hex: d39c30fdf7b7a4d9e2f59bdf9be9781a
16/12/30 19:47:35 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/s3/57b63810-c20a-438c-a73f-48d50e0be7d2-0004 94317523 bytes md5: 05ww/fe3pNni9Zvfm+l4Gg== md5hex: d39c30fdf7b7a4d9e2f59bdf9be9781a
16/12/30 19:47:46 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/s3/57b63810-c20a-438c-a73f-48d50e0be7d2-0005 94317523 bytes md5: 05ww/fe3pNni9Zvfm+l4Gg== md5hex: d39c30fdf7b7a4d9e2f59bdf9be9781a
16/12/30 19:47:57 ERROR s3n.MultipartUploadOutputStream: Upload attempts for part num: 2 have already reached max limit of: 5, will throw exception and fail
16/12/30 19:47:57 INFO s3n.MultipartUploadOutputStream: completeMultipartUpload error for key: output/part-20176
java.lang.IllegalStateException: Reached max limit of upload attempts for part
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.spawnNewFutureIfNeeded(MultipartUploadOutputStream.java:362)
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadMultiParts(MultipartUploadOutputStream.java:422)
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:471)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
    at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1290)
   ...
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:727)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:727)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/12/30 19:47:59 INFO s3n.MultipartUploadOutputStream: uploadPart error com.amazonaws.AbortedException: 
16/12/30 19:48:18 INFO s3n.MultipartUploadOutputStream: uploadPart error com.amazonaws.AbortedException: 

我刚收到5次重试失败的错误。我不明白这个原因。有没有人见过这个错误?这可能是什么原因?

我正在使用我自己的多输出格式实现编写序列文件:

class MultiOutputSequenceFileWriter(prefix: String, suffix: String) extends Serializable {
   private val writers = collection.mutable.Map[String, SequenceFile.Writer]()

   /**
     * @param pathKey    folder within prefix where the content will be written
     * @param valueKey   key of the data to be written
     * @param valueValue value of the data to be written
     */
   def write(pathKey: String, valueKey: Any, valueValue: Any) = {
     if (!writers.contains(pathKey)) {
       val path = new Path(prefix + "/" + pathKey + "/" + "part-" + suffix)
       val hadoopConf = new conf.Configuration()
       hadoopConf.setEnum("io.seqfile.compression.type", SequenceFile.CompressionType.NONE)
       val fs = FileSystem.get(hadoopConf)
       writers(pathKey) = SequenceFile.createWriter(hadoopConf, Writer.file(path),
         Writer.keyClass(valueKey.getClass()),
         Writer.valueClass(valueValue.getClass()),
         Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size", 4096)), //4KB
         Writer.replication(fs.getDefaultReplication()),
         Writer.blockSize(1073741824), // 1GB
         Writer.progressable(null),
         Writer.metadata(new Metadata()))
     }
     writers(pathKey).append(valueKey, valueValue)
   }
   def close = writers.values.foreach(_.close())
}

我正在尝试按如下方式编写序列文件:

...
rdd.mapPartitionsWithIndex { (p, it) => {
  val writer = new MultiOutputSequenceFileWriter("s3://bucket/output/", p.toString)
  for ( (key1, key2, data) <- it) {
    ...
    writer.write(key1, key2, data)
    ...
  }
  writer.close
  Nil.iterator
}.foreach( (x:Nothing) => ()) // To trigger iterator
}
...

注意:

  • 当我试图关闭作者时,我得到了异常(我认为作者在关闭之前尝试编写内容,我认为异常是由于此而来)。
  • 我用相同的输入重复了同样的工作两次。我在第一次重新运行时没有出错,但在第二次运行中遇到了三次错误。这可能只是S3中的短暂问题吗?
  • S3中没有失败的零件文件。

2 个答案:

答案 0 :(得分:4)

AWS支持工程师提到,在发生错误时,存储桶上有很多匹配。该作业正在重试默认次数(5),并且很可能所有重试都受到限制。现在,我在提交作业时添加了以下配置参数,增加了重试次数。

$(function() { var n=$(".navbar"), ns="navbar-scrolled", head=$('header').height(); $(window).scroll(function() { if( $(this).scrollTop() > head) { n.addClass(ns); } else { n.removeClass(ns); } }); });

此外,我已压缩输出,以便减少对S3的请求数。在这些变化之后,我没有看到几次运行失败。

答案 1 :(得分:0)

作者(亚马逊代码BTW,没有任何火花或hadoop团队将处理)在生成时(在后台线程中)将数据写入块中,其余数据和mulitpart上传在close()中提交也是代码阻止等待所有挂起的上传完成的地方。

听起来有些PUT请求已经失败,并且它处于close()调用中,此次失败被拾取并报告。我不知道EMR s3://客户端是否使用该块大小作为其分区的大小标记;我个人建议使用更小的尺寸,如128MB。

无论如何:假设暂时的网络问题,或者你分配的EC2 VM网络连接不良。要求新的VM。