我正在构建一个Spark应用程序,并尝试在EMR或容器中启动它之前在本地运行它。当实木复合地板文件本身是本地文件时,我可以使DataFrame正常工作,但是如果它在S3中,则拒绝读取实木复合地板文件。我试图设置所有我能想到的变量,这些变量是从S3a读取时建议的,这是我创建Spark会话的方式:
package util
import org.apache.spark.sql.SparkSession
import scala.io.Source
object SparkSessionFactory {
def generateSession(sessionLocation: String): SparkSession = {
val session = {
sessionLocation match {
case "local" =>
SparkSession.builder().appName("LocalS3SparkProfiler").master("yarn").master("local[*]")
.config("spark.driver.host", "localhost")
.config("fs.s3a.enableServerSideEncryption", "true")
.config("fs.s3a.serverSideEncryptionAlgorithm", "aws:kms")
.getOrCreate()
}
}
setHadoopConfigs(session, sessionLocation)
session
}
private def setHadoopConfigs(session:SparkSession, sessionLocation:String) = {
session.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
session.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
sessionLocation match {
case "local"=> {
val userHome = System.getProperty("user.home")
val aWSCredentialsLines = Source.fromFile(s"$userHome/.aws/credentials").getLines.toList
val key = aWSCredentialsLines(1).substring(aWSCredentialsLines(1).lastIndexOf(" ")).trim
val secret = aWSCredentialsLines(2).substring(aWSCredentialsLines(2).lastIndexOf(" ")).trim
val s3Token = aWSCredentialsLines(3).substring(aWSCredentialsLines(3).lastIndexOf(" ")).trim
session.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", key)
session.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secret)
session.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", s3Token)
session.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
}
}
}
}
然后当我尝试读取数据框时,我会呼叫
val spark = SparkSessionFactory.generateSession("local")
val df = spark.read.parquet("s3a://my-bucket/thepath/myparquetfile")
抛出的错误如下:
线程“主”中的异常 com.amazonaws.services.s3.model.AmazonS3Exception:禁止(服务: 亚马逊S3;状态码:403;错误代码:403禁止;禁止。要求编号: 366CFE11F21144F3; S3扩展请求ID: eW4C6PQZ4uSJOPmYKoZ8qCwmK4PwL6eFPwef9e1KLA3kL2LsiCMctZ + ZLYVplZh927iNiSro7ko =), S3扩展请求ID: eW4C6PQZ4uSJOPmYKoZ8qCwmK4PwL6eFPwef9e1KLA3kL2LsiCMctZ + ZLYVplZh927iNiSro7ko = 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1632) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.executeHelper(AmazonHttpClient.java:1058) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.doExecute(AmazonHttpClient.java:743) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.execute(AmazonHttpClient.java:699) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutor.access $ 500(AmazonHttpClient.java:667) 在 com.amazonaws.http.AmazonHttpClient $ RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) 在 com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) 在 com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4330) 在 com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4277) 在 com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1265)
我阅读的所有内容都表明我需要的凭据是我提供的凭据。我检查了key
,secret
和s3Token
的值,它们看起来是正确的,因为我在没有问题的使用正常AWS开发工具包的另一个项目中使用了这些凭据。>
对这个问题有什么想法吗?
答案 0 :(得分:0)
调试AWS Auth故障非常困难,因为AWS或任何实现客户端的人都不想将机密记录到控制台。通常,“ 403”和“ 400”一样没有用