将数据写入Hive Spark SQL时出现ArrayIndexOutOfBoundsException异常

时间:2018-03-30 07:11:19

标签: apache-spark hadoop hive apache-spark-sql hiveql

我正在尝试处理文本并将其写入Hive表。在插入过程中我遇到以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, 127.0.0.1, executor 0): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
    at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
    ... 8 more

这是我的代码:

object maintenance {
  case class event(Entity_Status_Code:String,Entity_Status_Description:String,Status:String,Event_Date:String,Event_Date2:String,Event_Date3:String,Event_Description:String)
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("maintenance").setMaster("local")
    conf.set("spark.debug.maxToStringFields", "10000000")
    val context = new SparkContext(conf)
    val sqlContext = new SQLContext(context)
    val hiveContext = new HiveContext(context)
    sqlContext.clearCache()
    //hiveContext.clearCache()
    //sqlContext.clearCache()

    import hiveContext.implicits._
    val rdd = context.textFile("file:///Users/hadoop/Downloads/sample.txt").map(line => line.split(" ")).map(x => event(x(0),x(1),x(2),x(3),x(4),x(5),x(6)))

    val personDF = rdd.toDF()
    personDF.show(10)
    personDF.registerTempTable("Maintenance")
    hiveContext.sql("insert into table default.maintenance select Entity_Status_Code,Entity_Status_Description,Status,Event_Date,Event_Date2,Event_Date3,Event_Description from Maintenance")


  }

当我评论与hiveContext相关的所有行并在本地运行时(我的意思是personDF.show())它的工作正常。但是当我运行spark-submit并启用hiveContext时会出现错误。

以下是我的示例数据:

4287053 06218896 N 19801222 19810901 19881222 M171 
4287053 06218896 N 19801222 19810901 19850211 M170 
4289713 06222552 Y 19810105 19810915 19930330 SM02 
4289713 06222552 Y 19810105 19810915 19930303 M285 
4289713 06222552 Y 19810105 19810915 19921208 RMPN 
4289713 06222552 Y 19810105 19810915 19921208 ASPN 
4289713 06222552 Y 19810105 19810915 19881116 ASPN 
4289713 06222552 Y 19810105 19810915 19881107 M171

1 个答案:

答案 0 :(得分:0)

将-1添加到拆分中,这可以解决您的问题(在您计算val rdd = ...的行上): line.split("", - 1)

空的字段将从分割中省略,导致arrayindexoutofbound。