由于我当前使用的Spark环境是hadoop2.7中的spark 2.4,但是hadoop2.7不支持SSE-KMS。
来自apache:HADOOP-13075,在2.8版中引入,在hadoop 3.0之后得到全面支持。然后从官方doc
应该添加两个配置参数fs.s3a.server-side-encryption-algorithm
和fs.s3a.server-side-encryption.key
”。
基于以前的文档,我在org.apache.hadoop:hadoop-aws:3.1.1
参数中添加了包com.amazonaws:aws-java-sdk:1.9.5
和spark-submit
,并添加
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", aws_sse_algorithm)`
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", aws_sse_key)
对于spark config
,aws_sse_algorithm
是我们管理员提供的SSE-KMS
和sse_key
。
与此同时,我基本上将所有可以添加的参数添加到配置中。 但是,我得到了这样的认识:
Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.
当我在spark中检索s3对象时:
df = spark.read.json('s3a://XXXXXXX/XXXXX/XXXXXXXX/result.json')
2019-08-09 14:54:09,525 ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
**com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 7C1C371AE02F476A, AWS Error Code: InvalidArgument,
AWS Error Message: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.**, S3 Extended Request ID: hlCH96//G18Bs47fGJwxt+Ccpdf0YNOadt9bUPYei2InkkUeKCslq/4m353RnQEhopBfvjVIcx0=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
.......
我的完整代码:
import datetime, time
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, IntegerType, DoubleType, ArrayType, StructType, StructField, MapType
import boto3
import json
import pytz
import configparser
import argparse
from dateutil.parser import parse
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:3.1.1,org.apache.hadoop:hadoop-common:3.1.1,org.apache.hadoop:hadoop-auth:3.1.1," \ ... "com.amazonaws:aws-java-sdk:1.9.5 " \ ... "pyspark-shell"
spark = SparkSession.builder.appName("test").getOrCreate() aws_sse_algorithm = 'SSE-KMS'
aws_sse_key = 'arn:aws:kms:ap-southeast-1:XXXXXXX:key/XXXXXX'
aws_access_id = 'XXXXX'
aws_access_key = 'XXXXX'
aws_region = 'ap-southeast-1'
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_id) spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_access_key) spark._jsc.hadoopConfiguration().set("fs.s3a.fast.upload", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider") spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3."+aws_region+".amazonaws.com")
spark._jsc.hadoopConfiguration().set("fs.s3a.sse.enabled", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", aws_sse_algorithm) spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", aws_sse_key) spark._jsc.hadoopConfiguration().set("fs.s3a.sse.kms.keyId", aws_sse_key)
df = spark.read.json('s3a://XXXXXXX/XXXXX/XXXXXXXX/result.json')
我不确定是否与本地spark_classpath中的hadoop jar有关,是否仍在2.7.3版本以下。但是我将3.1.1 jar添加到--packages部分以获取火花。
答案 0 :(得分:1)
如果必须设置jvm选项才能使v4签名正常工作,那么您仍在使用hadoop-2.7 s3a实现。
可悲的是,直到您拥有一套一致的JAR为止。您只需要花一点时间就可以移动堆栈轨迹。首先正确获取那些依赖项。
这意味着:升级到Hadoop 2.8+工件。完全
答案 1 :(得分:0)
我知道了,针对亚马逊s3签名v4的正确配置是:
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
不是
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")