几次成功请求后Hadoop S3-driver 403错误

时间:2019-07-16 16:32:39

标签: amazon-web-services hadoop amazon-s3 nutch

我将AWS S3驱动程序与Apache Nutch结合使用,以将文件从EC2实例上传到S3存储桶。 EC2附带有IAM策略,以允许访问S3存储桶:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::storage"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:GetObjectAcl"
      ],
      "Resource": [
        "arn:aws:s3:::storage/*"
      ]
    }
  ]
}

一开始它运行良好:Nutch解析段并将其写入S3存储桶,但经过几段后,它因错误而失败:

  

状态代码:403,AWS服务:Amazon S3,AWS请求ID:...,AWS错误代码:SignatureDoesNotMatch,AWS错误消息:我们计算出的请求签名与您提供的签名不匹配。

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: ...
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507)
        at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143)
        at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131)
        at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189)
        at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134)
        at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[ERROR] org.apache.nutch.crawl.CrawlDb: CrawlDb update job did not succeed, job status:FAILED, reason: NA
Exception in thread "main" java.lang.RuntimeException: CrawlDb update job did not succeed, job status:FAILED, reason: NA
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:142)
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:83)

我认为IAM策略是可以的,因为Nutch可以在失败之前上传少量分段。

我与AWS Hadoop相关的配置是:

com.amazonaws.services.s3.enableV4=true
fs.s3a.endpoint=s3.us-east-2.amazonaws.com

为什么会出现此错误以及如何解决?


更新: 我正在单台EC2计算机(而非Hadoop集群)上以编程方式(而非从CLI)运行Nutch,以访问S3我正在使用s3a文件系统(输出路径为s3a://mybucket/data)。 Hadoop版本为2.7.3,Nutch版本为1.15

1 个答案:

答案 0 :(得分:1)

由于S3不一致的副作用,在本地模式下运行时会出现上述错误。

  

由于S3仅在写入后读取时提供最终的一致性,因此不能保证在列出文件或尝试重命名文件时,即使先前已将其写入S3存储桶中,也仍然存在。 / p>

Hadoop团队还提供了以下故障排除指南:https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

如果您的用例需要以本地模式运行,我建议采取以下解决方法:

  1. 将文件写入local-folder
  2. 使用aws s3 sync local-folder s3://bucket-name --region region-name --delete