将文件加载到Sparklyr中的S3时,Spark_write_csv()错误

时间:2018-07-09 07:35:03

标签: r apache-spark amazon-s3 apache-spark-sql sparklyr

我正在尝试使用以下代码将Spark DataFrame文件加载到S3,但出现错误。当我尝试将文件本地保存在笔记本电脑中时,相同的代码可以正常工作。

# Set driver and executor memory allocations
Sys.setenv(AWS_ACCESS_KEY_ID="xyz")
Sys.setenv(AWS_SECRET_ACCESS_KEY="abc")

config <- spark_config()
config$sparklyr.defaultPackages <- c(
"com.databricks:spark-csv_2.10:1.3.0",
"com.amazonaws:aws-java-sdk-pom:1.10.34",
"org.apache.hadoop:hadoop-aws:2.7.2")

spark_cluster<-spark_connect(master = 'local', config = config,version = "2.1.0")

iris_1<-copy_to(spark_cluster,iris,overwrite = T)

iris_1<-sdf_coalesce(iris_1,1)

spark_write_csv(iris_1,path = "s3a://mypath..../Final_Data_MAF/iris.csv",header = T,mode = "overwrite") 
#this code does not work when I execute this to save file in S3 



spark_write_csv(iris_1, path="C:\\Users\\yogesh\\Desktop\\Work\ Updated File", header = TRUE,mode="overwrite") 
# This code works fine n saves file locally in my laptop

这是我得到的错误。

  

错误:org.apache.spark.SparkException:作业中止。在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:147)   在java.lang.Thread.run(未知来源)

     

由以下原因引起:org.apache.spark.SparkException:作业因阶段中止   失败:阶段5.0中的任务0失败1次,最近一次失败:丢失   5.0阶段中的任务0.0(TID 5,本地主机,执行程序驱动程序):   java.lang.UnsatisfiedLinkError:   org.apache.hadoop.io.nativeio.NativeIO $ Windows.access0(Ljava / lang / String; I)Z   在org.apache.hadoop.io.nativeio.NativeIO $ Windows.access0(Native   方法)   org.apache.hadoop.io.nativeio.NativeIO $ Windows.access(NativeIO.java:609)   在org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)处   org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)   在   org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)   在org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)   在   org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.confChanged(LocalDirAllocator.java:285)   在   org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)   在   org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)   在   org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)   在   org.apache.hadoop.fs.s3a.S3AOutputStream。(S3AOutputStream.java:87)   在   org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)   在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)处   org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)在   org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)在   org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)   在   org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter。(CSVRelation.scala:208)   在   org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask。(FileFormatWriter.scala:234)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:182)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:129)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:128)   在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)   在org.apache.spark.scheduler.Task.run(Task.scala:99)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)   在java.util.concurrent.ThreadPoolExecutor.runWorker(未知来源)   在java.util.concurrent.ThreadPoolExecutor $ Worker.run(未知来源)   在java.lang.Thread.run(未知来源)

     

驱动程序堆栈跟踪:位于   org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1423)   在组织中。

0 个答案:

没有答案