403尝试从本地S3中的拼花文件构建DataFrame时发生异常

时间:2019-07-09 22:45:25

标签: scala apache-spark amazon-s3

我正在构建一个Spark应用程序,并尝试在EMR或容器中启动它之前在本地运行它。当实木复合地板文件本身是本地文件时,我可以使DataFrame正常工作,但是如果它在S3中,则拒绝读取实木复合地板文件。我试图设置所有我能想到的变量,这些变量是从S3a读取时建议的,这是我创建Spark会话的方式:

package util

import org.apache.spark.sql.SparkSession
import scala.io.Source

object SparkSessionFactory {

  def generateSession(sessionLocation: String): SparkSession = {
    val session = {
      sessionLocation match {
        case "local" =>
          SparkSession.builder().appName("LocalS3SparkProfiler").master("yarn").master("local[*]")
            .config("spark.driver.host", "localhost")
            .config("fs.s3a.enableServerSideEncryption", "true")
            .config("fs.s3a.serverSideEncryptionAlgorithm", "aws:kms")
            .getOrCreate()
      }
    }
    setHadoopConfigs(session, sessionLocation)
    session
  }

  private def setHadoopConfigs(session:SparkSession, sessionLocation:String) = {
    session.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    session.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
    sessionLocation match {
      case "local"=> {
        val userHome = System.getProperty("user.home")
        val aWSCredentialsLines = Source.fromFile(s"$userHome/.aws/credentials").getLines.toList

        val key = aWSCredentialsLines(1).substring(aWSCredentialsLines(1).lastIndexOf(" ")).trim
        val secret = aWSCredentialsLines(2).substring(aWSCredentialsLines(2).lastIndexOf(" ")).trim
        val s3Token = aWSCredentialsLines(3).substring(aWSCredentialsLines(3).lastIndexOf(" ")).trim

        session.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", key)
        session.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secret)
        session.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", s3Token)
        session.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
      }
    }
  }
}

然后当我尝试读取数据框时,我会呼叫

val spark = SparkSessionFactory.generateSession("local")
val df = spark.read.parquet("s3a://my-bucket/thepath/myparquetfile")

抛出的错误如下:

  

线程“主”中的异常   com.amazonaws.services.s3.model.AmazonS3Exception:禁止(服务:   亚马逊S3;状态码:403;错误代码:403禁止;禁止。要求编号:   366CFE11F21144F3; S3扩展请求ID:   eW4C6PQZ4uSJOPmYKoZ8qCwmK4PwL6eFPwef9e1KLA3kL2LsiCMctZ + ZLYVplZh927iNiSro7ko =),   S3扩展请求ID:   eW4C6PQZ4uSJOPmYKoZ8qCwmK4PwL6eFPwef9e1KLA3kL2LsiCMctZ + ZLYVplZh927iNiSro7ko =     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1632)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.executeHelper(AmazonHttpClient.java:1058)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.doExecute(AmazonHttpClient.java:743)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.execute(AmazonHttpClient.java:699)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutor.access $ 500(AmazonHttpClient.java:667)     在   com.amazonaws.http.AmazonHttpClient $ RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)     在   com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)     在   com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4330)     在   com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4277)     在   com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1265)

我阅读的所有内容都表明我需要的凭据是我提供的凭据。我检查了keysecrets3Token的值,它们看起来是正确的,因为我在没有问题的使用正常AWS开发工具包的另一个项目中使用了这些凭据。

对这个问题有什么想法吗?

1 个答案:

答案 0 :(得分:0)

调试AWS Auth故障非常困难,因为AWS或任何实现客户端的人都不想将机密记录到控制台。通常,“ 403”和“ 400”一样没有用

  1. 看看Troubleshooting S3A
  2. 以及直接身份验证问题,如果使用AWS KMS密钥对文件进行SSE-KMS加密(而您的账户无权访问),则会导致身份验证失败。错误消息没有特别指出。
  3. 使用相同的信誉尝试AWS CLI,以查看它们是否有效。如果他们让您看到数据,那么不可避免地会出现一些spark / s3a配置问题。
  4. 下载Hadoop的最新版本(理想情况下为3.2),安装并使用选项将其配置为core-site.xml。然后使用Cloudstore storediag使其对登录过程进行结构化调试。如果那不起作用,那么火花也不会起作用。