我尝试使用spotify' spark-bigquery
spark package将数据框保存到Google BigQuery表中,但失败了。我在谷歌的数据交换平台上使用它。
df.saveAsBigQueryTable("my-project:my_dataset.my_table")
这是错误日志:
org.apache.spark.SparkException:作业已中止。 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:215) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:173) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:173) 在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .write(FileFormatWriter.scala:173) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:138) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) 在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92) 在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 在org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438) 在org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474) 在org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:138) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) 在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92) 在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 在org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) at com.databricks.spark.avro.package $ AvroDataFrameWriter $$ anonfun $ avro $ 1.apply(package.scala:26) at com.databricks.spark.avro.package $ AvroDataFrameWriter $$ anonfun $ avro $ 1.apply(package.scala:26) 在com.spotify.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:54) 在com.spotify.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:67) ... 84被忽略
引起:org.apache.spark.SparkException:作业因阶段失败而中止:阶段107.0中的任务0失败4次 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:272)" 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:191)" 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:190)" 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)" 在org.apache.spark.scheduler.Task.run(Task.scala:108)" 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:335)" 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)" 在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)" 在java.lang.Thread.run(Thread.java:748)" 引起:java.lang.AbstractMethodError:org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg / apache / hadoop / mapreduce / TaskAttemptContext;)Ljava / lang / String; 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)" 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)" 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:258)" 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:256)" 在org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)" 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:261)" ......还有8个"
驱动程序堆栈跟踪: 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1499) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1487) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1486) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814) 在scala.Option.foreach(Option.scala:257) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:188) ......还有121个 引起:org.apache.spark.SparkException:写入行时任务失败 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:272) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:191) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(FileFormatWriter.scala:190) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:108) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scal) 一:335) ......还有3个 引起:java.lang.AbstractMethodError:org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg / apache / hadoop / mapreduce / TaskAttemptContext;)Ljava / lang / String; 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:258) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask $ 3.apply(FileFormatWriter.scala:256) 在org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) 在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:261) ......还有8个
答案 0 :(得分:1)
我发现可疑的部分是读取的部分:引起:java.lang.AbstractMethodError:org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg / apache / hadoop / mapreduce / TaskAttemptContext; )Ljava /郎/字符串;在org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)"
搜索此特定错误会导致:https://github.com/databricks/spark-avro/issues/208。
由于spark-avro在Spark 2.2中遇到问题,您可能会发现使用Dataproc映像版本1.1(Spark 2.0)可以使spark-bigquery库正常运行。