saveAsBigQueryTable不起作用

时间:2017-11-29 20:23:23

标签: apache-spark google-bigquery spark-dataframe spotify google-cloud-dataproc

我尝试使用spotify' spark-bigquery spark package将数据框保存到Google BigQuery表中,但失败了。我在谷歌的数据交换平台上使用它。

df.saveAsBigQueryTable("my-project:my_dataset.my_table") 

这是错误日志:

  

org.apache.spark.SparkException:作业已中止。    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:215)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:173)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:173)    在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .write(FileFormatWriter.scala:173)    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)    在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)    在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)    在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)    在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117)    在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117)    在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:138)    在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)    在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)    在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)    在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92)    在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)    在org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)    在org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)    在org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)    在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)    在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)    在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)    在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117)    在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117)    在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:138)    在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)    在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)    在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)    在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92)    在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)    在org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)    在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)    在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)    at com.databricks.spark.avro.package $ AvroDataFrameWriter $$ anonfun $ avro $ 1.apply(package.scala:26)    at com.databricks.spark.avro.package $ AvroDataFrameWriter $$ anonfun $ avro $ 1.apply(package.scala:26)    在com.spotify.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:54)    在com.spotify.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:67)    ... 84被忽略

     

引起:org.apache.spark.SparkException:作业因阶段失败而中止:阶段107.0中的任务0失败4次     在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:272)"     在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:191)"     在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:190)"     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)"     在org.apache.spark.scheduler.Task.run(Task.scala:108)"     在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:335)"     在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)"     在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)"     在java.lang.Thread.run(Thread.java:748)"   引起:java.lang.AbstractMethodError:org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg / apache / hadoop / mapreduce / TaskAttemptContext;)Ljava / lang / String;     在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)"     在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)"     在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:258)"     在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:256)"     在org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)"     在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:261)"     ......还有8个"

     

驱动程序堆栈跟踪:    在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1499)    在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1487)    在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1486)    在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)    在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)    在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)    在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)    在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)    在scala.Option.foreach(Option.scala:257)    在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)    在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)    在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)    在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)    在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)    在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)    在org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:188)    ......还有121个   引起:org.apache.spark.SparkException:写入行时任务失败    在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:272)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:191)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:190)    在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)    在org.apache.spark.scheduler.Task.run(Task.scala:108)    在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scal)   一:335)    ......还有3个   引起:java.lang.AbstractMethodError:org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg / apache / hadoop / mapreduce / TaskAttemptContext;)Ljava / lang / String;    在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:258)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:256)    在org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)    在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:261)    ......还有8个

1 个答案:

答案 0 :(得分:1)

我发现可疑的部分是读取的部分:引起:java.lang.AbstractMethodError:org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg / apache / hadoop / mapreduce / TaskAttemptContext; )Ljava /郎/字符串;在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)"

搜索此特定错误会导致:https://github.com/databricks/spark-avro/issues/208

由于spark-avro在Spark 2.2中遇到问题,您可能会发现使用Dataproc映像版本1.1(Spark 2.0)可以使spark-bigquery库正常运行。