在尝试将数据帧写入S3时,出现了空指针异常以下错误。有时候,这项工作进展顺利,有时却失败了。
我正在使用EMR 5.20和spark 2.4.0
创建火花会话
val spark = SparkSession.builder
.config("spark.sql.parquet.binaryAsString", "true")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.parquet.fs.optimized.committer.optimization-enabled","true")
.getOrCreate()
spark.sql("myQuery").write.partitionBy("partitionColumn").mode(SaveMode.Overwrite).option("inferSchema","false").parquet("s3a://...filePath")
任何人都可以帮助解决这个难题。预先感谢
java.lang.NullPointerException
at com.amazon.ws.emr.hadoop.fs.s3.lite.S3Errors.isHttp200WithErrorCode(S3Errors.java:57)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:100)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:127)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:364)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
at org.apache.spark.internal.io.FileCommitProtocol.deleteWithJob(FileCommitProtocol.scala:124)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:223)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:122)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
... 55 elided
答案 0 :(得分:1)
看起来像AWS代码中的错误。这是封闭源代码-您必须与他们合作。
我确实看到一个暗示,这是试图解析错误响应的代码中的错误。也许某些事情失败了,但是客户端上传递该错误响应的代码却有问题。这不是不寻常的是-故障处理很少获得足够的测试覆盖率
答案 1 :(得分:1)
您正在使用SaveMode.Overwrite
,错误行com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:127)
表示覆盖的删除操作过程中出现问题。
我将检查并确保您的EMR EC2实例配置文件的IAM策略中的S3权限允许对写入Parquet的调用中的文件路径进行s3:DeleteObject
操作。它应该看起来像这样:
{
"Sid": "AllowWriteAccess",
"Action": [
"s3:DeleteObject",
"s3:Get*",
"s3:List*",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": [
"<arn_for_your_filepath>/*"
]
}
在两次作业之间,您在调用Parquet时是否使用了不同的文件路径?如果是这样的话,那将解释间歇性的工作失败