Spark Dataframes已成功创建,但无法写入本地磁盘

时间:2017-08-30 14:47:38

标签: apache-spark intellij-idea spark-dataframe

我正在使用IntelliJ IDE在Microsoft Windows平台上执行Spark Scala代码。

我有四个Spark Dataframe,每个记录大约有30000条记录,我试图从每个Dataframe中取一列作为我的要求。

我使用Spark SQL函数来完成它并成功执行。当我执行DF.show()或DF.count()方法时,我能够在屏幕上看到结果但是当我尝试将数据帧写入我的本地磁盘(windows目录)时,作业将因为以下错误而中止:

  

线程中的异常" main" org.apache.spark.SparkException:Job   中止。在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用$ MCV $ SP(FileFormatWriter.scala:147)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用(FileFormatWriter.scala:121)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用(FileFormatWriter.scala:121)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ .WRITE(FileFormatWriter.scala:121)     在   org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)     在   org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.适用(SparkPlan.scala:114)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.适用(SparkPlan.scala:114)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ $的executeQuery 1.适用(SparkPlan.scala:135)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在   org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)     在   org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)     在   org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:87)     在   org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)     在   org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)     在   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)     在   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)     at main.src.countFeatures2 $ .countFeature $ 1(countFeatures2.scala:118)     在   main.src.countFeatures2 $ .getFeatureAsString $ 1(countFeatures2.scala:32)     在main.src.countFeatures2 $ .main(countFeatures2.scala:40)at   main.src.countFeatures2.main(countFeatures2.scala)at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)   引起:org.apache.spark.SparkException:作业由于阶段而中止   失败:阶段31.0中的任务0失败1次,最近失败:   阶段31.0中丢失的任务0.0(TID 2636,localhost,执行程序驱动程序):   java.io.IOException:(null)命令字符串中的条目:null chmod 0644   d:\ Test_Output_File2_temporary \ 0_temporary \ attempt_20170830194047_0031_m_000000_0 \部分00000-85c32c55-e12d-4433-979d-ccecb2fcd341.csv     在   org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:770)     在org.apache.hadoop.util.Shell.execCommand(Shell.java:866)at   org.apache.hadoop.util.Shell.execCommand(Shell.java:849)at   org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)     在   org.apache.hadoop.fs.RawLocalFileSystem $ LocalFSFileOutputStream。(RawLocalFileSystem.java:225)     在   org.apache.hadoop.fs.RawLocalFileSystem $ LocalFSFileOutputStream。(RawLocalFileSystem.java:209)     在   org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)     在   org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)     在   org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)     在   org.apache.hadoop.fs.ChecksumFileSystem $ ChecksumFSOutputSummer。(ChecksumFileSystem.java:398)     在   org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)     在   org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)     在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)at   org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)at at   org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)at at   org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)     在   。org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter(CSVRelation.scala:208)     在   org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask(FileFormatWriter.scala:234)。     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ .ORG $阿帕奇$火花$ SQL $执行$ $的数据源$$ FileFormatWriter executeTask(FileFormatWriter.scala:182)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:129)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ $写1 $$ anonfun $ 3.apply(FileFormatWriter.scala:128)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)     在org.apache.spark.scheduler.Task.run(Task.scala:99)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1435)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1423)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1422)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:802)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:802)     在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)at at   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用$ MCV $ SP(FileFormatWriter.scala:127)     ... 28更多引起:java.io.IOException :( null)在命令中输入   string:null chmod 0644   d:\ Test_Output_File2_temporary \ 0_temporary \ attempt_20170830194047_0031_m_000000_0 \部分00000-85c32c55-e12d-4433-979d-ccecb2fcd341.csv     在   org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:770)     在org.apache.hadoop.util.Shell.execCommand(Shell.java:866)at   org.apache.hadoop.util.Shell.execCommand(Shell.java:849)at   org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)     在   org.apache.hadoop.fs.RawLocalFileSystem $ LocalFSFileOutputStream。(RawLocalFileSystem.java:225)     在   org.apache.hadoop.fs.RawLocalFileSystem $ LocalFSFileOutputStream。(RawLocalFileSystem.java:209)     在   org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)     在   org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)     在   org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)     在   org.apache.hadoop.fs.ChecksumFileSystem $ ChecksumFSOutputSummer。(ChecksumFileSystem.java:398)     在   org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)     在   org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)     在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)at   org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)at at   org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)at at   org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)     在   。org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter(CSVRelation.scala:208)     在   org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask(FileFormatWriter.scala:234)。     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ .ORG $阿帕奇$火花$ SQL $执行$ $的数据源$$ FileFormatWriter executeTask(FileFormatWriter.scala:182)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:129)     在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ $写1 $$ anonfun $ 3.apply(FileFormatWriter.scala:128)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)     在org.apache.spark.scheduler.Task.run(Task.scala:99)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)拿起_JAVA_OPTIONS:   -Xmx512M

     

使用退出代码1完成处理

我无法理解它出错的地方。任何人都可以解释如何克服这个问题吗?

更新 请注意,我能够在昨天之前编写相同的文件,并且在我的系统或IDE的配置中没有进行任何更改。所以我不明白为什么它一直运行到昨天,为什么它现在不运行

此链接中有一个类似的帖子:(null) entry in command string exception in saveAsTextFile() on Pyspark但他们在Jupiter笔记本上使用pyspark而我的问题是使用IntelliJ IDE

将输出文件写入本地磁盘的超简化代码

val Test_Output =spark.sql("select A.Col1, A.Col2, B.Col2, C.Col2, D.Col2 from A, B, C, D where A.primaryKey = B.primaryKey and B.primaryKey = C.primaryKey and C.primaryKey = D.primaryKey and D.primaryKey = A.primaryKey")

val Test_Output_File = Test_Output.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue", "0").save("D:/Test_Output_File")

2 个答案:

答案 0 :(得分:2)

似乎与文件系统有关:java.io.IOException: (null) entry in command string: null chmod 0644

由于您在Windows上运行,您是否已将HADOOP_HOME设置为具有winutils.exe的文件夹?

答案 1 :(得分:0)

最后我纠正了自己。我在创建数据帧时使用了 .persist()方法。这有助于我编写输出文件而没有任何错误。虽然我不明白它背后的逻辑。

感谢您对此的宝贵意见