使用sparklyr

时间:2018-07-05 12:34:11

标签: r apache-spark amazon-s3 rstudio sparklyr

我正在尝试使用Sparklyr从Spark环境读取数据并将数据写入s3

Sys.setenv(AWS_ACCESS_KEY_ID="xxxx")
Sys.setenv(AWS_SECRET_ACCESS_KEY="xxxx")

config <- spark_config()
config$sparklyr.defaultPackages <- c(
"com.databricks:spark-csv_2.11:1.3.0",
"com.amazonaws:aws-java-sdk-pom:1.10.34",
"org.apache.hadoop:hadoop-aws:2.7.2")


sc<-spark_connect(master = "local" , config=config)

#Reading data from S3 works fine
table_1 <- spark_read_csv(sc,name = "Iris_data",path = "s3a://bucket_path/iris.csv") 

 #Writing data to S3 throws error
 flights_tbl<-copy_to(sc,flights,"flights")
 spark_write_csv(flights_tbl,  path="s3a://bucket_path/data" , mode = "overwrite" )

从S3读取数据工作正常,但是使用spark_write_csv函数将数据写入S3会引发错误。

这是将数据写入s3时出现的错误:

  

错误:org.apache.spark.SparkException:作业中止。       在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .write(FileFormatWriter.scala:224)       在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)中       在org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult $ lzycompute(commands.scala:104)       在org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)       在org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)       在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:131)       在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:127)       在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:155)       在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)       在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)       在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)       在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:80)       在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)       在org.apache.spark.sql.DataFrameWriter $$ anonfun $ runCommand $ 1.apply(DataFrameWriter.scala:654)       在org.apache.spark.sql.DataFrameWriter $$ anonfun $ runCommand $ 1.apply(DataFrameWriter.scala:654)       在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:77)       在org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)       在org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)       在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)       在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)       在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处       在sun.reflect.NativeMethodAccessorImpl.invoke(未知来源)       在sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)       在java.lang.reflect.Method.invoke(未知来源)       在sparklyr.Invoke.invoke(invoke.scala:137)       在sparklyr.StreamHandler.handleMethodCall(stream.scala:123)       在sparklyr.StreamHandler.read(stream.scala:66)       在sparklyr.BackendHandler.channelRead0(handler.scala:51)       在sparklyr.BackendHandler.channelRead0(handler.scala:4)       在io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)       在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)       在io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)       在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)       在io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)       在io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)       在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)       在io.netty.channel.DefaultChannelPipeline $ HeadContext.channelRead(DefaultChannelPipeline.java:1359)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)       在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)       在io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)       在io.netty.channel.nio.AbstractNioByteChannel $ NioByteUnsafe.read(AbstractNioByteChannel.java:138)       在io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)       在io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)       在io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)       在io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)       在io.netty.util.concurrent.SingleThreadEventExecutor $ 5.run(SingleThreadEventExecutor.java:858)       在io.netty.util.concurrent.DefaultThreadFactory $ DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)       在java.lang.Thread.run(未知来源)       由以下原因引起:org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段7.0中的任务0失败1次,最近一次失败:阶段7.0中的任务0.0(TID 7,本地主机,执行程序驱动程序)丢失:org.apache .spark.SparkException:写入行时任务失败。       在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:285)       在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:197)       在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:196)       在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)       在org.apache.spark.scheduler.Task.run(Task.scala:109)       在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)       在java.util.concurrent.ThreadPoolExecutor.runWorker(未知来源)       在java.util.concurrent.ThreadPoolExecutor $ Worker.run(未知来源)       在java.lang.Thread.run(未知来源)       引起原因:java.lang.UnsatisfiedLinkError:org.apache.hadoop.io.nativeio.NativeIO $ Windows.access0(Ljava / lang / String; I)Z       在org.apache.hadoop.io.nativeio.NativeIO $ Windows.access0(本机方法)       在org.apache.hadoop.io.nativeio.NativeIO $ Windows.access(NativeIO.java:609)       在org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)       在org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)       在org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)       在org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)       在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.confChanged(LocalDirAllocator.java:285)       在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)       在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)       在org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)       在org.apache.hadoop.fs.s3a.S3AOutputStream。(S3AOutputStream.java:87)       在org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)       在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)       在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)       在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)       在org.apache.spark.sql.execution.datasources.CodecStreams $ .createOutputStream(CodecStreams.scala:81)处       在org.apache.spark.sql.execution.datasources.CodecStreams $ .createOutputStreamWriter(CodecStreams.scala:92)处       在org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter。(CSVFileFormat.scala:149)       在org.apache.spark.sql.execution.datasources.csv.CSVFileFormat $$ anon $ 1.newInstance(CSVFileFormat.scala:77)       在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)处       在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)       在org.apache.spark.sql.execution。

0 个答案:

没有答案