运行spark-sql用户spark-shell,抛出异常[原因:java.lang.IllegalArgumentException:字段“ id”不存在。]

时间:2018-11-01 11:44:14

标签: apache-spark hive apache-spark-sql left-join

首先,使用spark-sql命令创建数据集:

spark.sql("select id ,a.userid,regexp_replace(b.tradeno,',','|') as TradeNo
,Amount ,TradeType ,TxTypeId
,regexp_replace(title,',','|') as title
,status ,tradetime ,TradeStatus
,regexp_replace(otherside,',','') as otherside
from
(
    select userid 
    from tableA
    where daykey='2018-10-30'
    group by userid
) a 
left join tableb b
on a.userid=b.userid 
where b.userid is not null")

结果是:

dataset: org.apache.spark.sql.DataFrame = [id: bigint, userid: int ... 9 more fields]

然后,使用以下命令将数据集导出为csv:

dataset.coalesce(40).write.option("delimiter", ",").option("charset", "utf-8").csv("/binlog_test/mycsv.excel")

火花任务运行时,发生以下错误:

驱动程序堆栈跟踪:

  

org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1430)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1418)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1417)         在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)         在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)         在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1417)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:797)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:797)         在scala.Option.foreach(Option.scala:257)         在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:797)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1645)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1600)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1589)         在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)         在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:1943)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:1963)         在org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:127)         ...更多69       原因:java.lang.IllegalArgumentException:字段“ id”不存在。         在org.apache.spark.sql.types.StructType $$ anonfun $ fieldIndex $ 1.apply(StructType.scala:290)         在org.apache.spark.sql.types.StructType $$ anonfun $ fieldIndex $ 1.apply(StructType.scala:290)         在scala.collection.MapLike $ class.getOrElse(MapLike.scala:128)         在scala.collection.AbstractMap.getOrElse(Map.scala:59)         在org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:289)         在org.apache.spark.sql.hive.orc.OrcRelation $$ anonfun $ 6.apply(OrcFileFormat.scala:308)         在org.apache.spark.sql.hive.orc.OrcRelation $$ anonfun $ 6.apply(OrcFileFormat.scala:308)         在scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:234)         在scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:234)         在scala.collection.Iterator $ class.foreach(Iterator.scala:893)         在scala.collection.AbstractIterator.foreach(Iterator.scala:1336)         在scala.collection.IterableLike $ class.foreach(IterableLike.scala:72)         在org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)         在scala.collection.TraversableLike $ class.map(TraversableLike.scala:234)         在org.apache.spark.sql.types.StructType.map(StructType.scala:96)         在org.apache.spark.sql.hive.orc.OrcRelation $ .setRequiredColumns(OrcFileFormat.scala:308)         在org.apache.spark.sql.hive.orc.OrcFileFormat $$ anonfun $ buildReader $ 2.apply(OrcFileFormat.scala:140)         在org.apache.spark.sql.hive.orc.OrcFileFormat $$ anonfun $ buildReader $ 2.apply(OrcFileFormat.scala:129)         在org.apache.spark.sql.execution.datasources.FileFormat $$ anon $ 1.apply(FileFormat.scala:138)         在org.apache.spark.sql.execution.datasources.FileFormat $$ anon $ 1.apply(FileFormat.scala:122)         在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator(FileScanRDD.scala:168)         在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext(FileScanRDD.scala:109)         在org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知来源)         在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)         在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $ anon $ 1.hasNext(WholeStageCodegenExec.scala:377)         在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)         在org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)         在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)         在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)         在org.apache.spark.scheduler.Task.run(Task.scala:99)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:325)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)         在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)         在java.lang.Thread.run(Thread.java:745)

但是,当我直接执行连接操作时,请使用hive并用连接结果创建一个新表,最后使用前面的spark-sql命令导出数据集,一切正常。

0 个答案:

没有答案