我正在尝试使用Sparklyr从Spark环境读取数据并将数据写入s3
Sys.setenv(AWS_ACCESS_KEY_ID="xxxx")
Sys.setenv(AWS_SECRET_ACCESS_KEY="xxxx")
config <- spark_config()
config$sparklyr.defaultPackages <- c(
"com.databricks:spark-csv_2.11:1.3.0",
"com.amazonaws:aws-java-sdk-pom:1.10.34",
"org.apache.hadoop:hadoop-aws:2.7.2")
sc<-spark_connect(master = "local" , config=config)
#Reading data from S3 works fine
table_1 <- spark_read_csv(sc,name = "Iris_data",path = "s3a://bucket_path/iris.csv")
#Writing data to S3 throws error
flights_tbl<-copy_to(sc,flights,"flights")
spark_write_csv(flights_tbl, path="s3a://bucket_path/data" , mode = "overwrite" )
从S3读取数据工作正常,但是使用spark_write_csv函数将数据写入S3会引发错误。
这是将数据写入s3时出现的错误:
错误:org.apache.spark.SparkException:作业中止。 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .write(FileFormatWriter.scala:224) 在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)中 在org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult $ lzycompute(commands.scala:104) 在org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) 在org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:131) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:127) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:155) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) 在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:80) 在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 在org.apache.spark.sql.DataFrameWriter $$ anonfun $ runCommand $ 1.apply(DataFrameWriter.scala:654) 在org.apache.spark.sql.DataFrameWriter $$ anonfun $ runCommand $ 1.apply(DataFrameWriter.scala:654) 在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:77) 在org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) 在org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 在sun.reflect.NativeMethodAccessorImpl.invoke(未知来源) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源) 在java.lang.reflect.Method.invoke(未知来源) 在sparklyr.Invoke.invoke(invoke.scala:137) 在sparklyr.StreamHandler.handleMethodCall(stream.scala:123) 在sparklyr.StreamHandler.read(stream.scala:66) 在sparklyr.BackendHandler.channelRead0(handler.scala:51) 在sparklyr.BackendHandler.channelRead0(handler.scala:4) 在io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 在io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 在io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) 在io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 在io.netty.channel.DefaultChannelPipeline $ HeadContext.channelRead(DefaultChannelPipeline.java:1359) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 在io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) 在io.netty.channel.nio.AbstractNioByteChannel $ NioByteUnsafe.read(AbstractNioByteChannel.java:138) 在io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) 在io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) 在io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) 在io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) 在io.netty.util.concurrent.SingleThreadEventExecutor $ 5.run(SingleThreadEventExecutor.java:858) 在io.netty.util.concurrent.DefaultThreadFactory $ DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) 在java.lang.Thread.run(未知来源) 由以下原因引起:org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段7.0中的任务0失败1次,最近一次失败:阶段7.0中的任务0.0(TID 7,本地主机,执行程序驱动程序)丢失:org.apache .spark.SparkException:写入行时任务失败。 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:285) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:197) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:196) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:109) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345) 在java.util.concurrent.ThreadPoolExecutor.runWorker(未知来源) 在java.util.concurrent.ThreadPoolExecutor $ Worker.run(未知来源) 在java.lang.Thread.run(未知来源) 引起原因:java.lang.UnsatisfiedLinkError:org.apache.hadoop.io.nativeio.NativeIO $ Windows.access0(Ljava / lang / String; I)Z 在org.apache.hadoop.io.nativeio.NativeIO $ Windows.access0(本机方法) 在org.apache.hadoop.io.nativeio.NativeIO $ Windows.access(NativeIO.java:609) 在org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977) 在org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187) 在org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174) 在org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108) 在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.confChanged(LocalDirAllocator.java:285) 在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344) 在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416) 在org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198) 在org.apache.hadoop.fs.s3a.S3AOutputStream。(S3AOutputStream.java:87) 在org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410) 在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) 在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) 在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789) 在org.apache.spark.sql.execution.datasources.CodecStreams $ .createOutputStream(CodecStreams.scala:81)处 在org.apache.spark.sql.execution.datasources.CodecStreams $ .createOutputStreamWriter(CodecStreams.scala:92)处 在org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter。(CSVFileFormat.scala:149) 在org.apache.spark.sql.execution.datasources.csv.CSVFileFormat $$ anon $ 1.newInstance(CSVFileFormat.scala:77) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)处 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) 在org.apache.spark.sql.execution。