我正在尝试处理文本并将其写入Hive表。在插入过程中我遇到以下错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, 127.0.0.1, executor 0): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
... 8 more
这是我的代码:
object maintenance {
case class event(Entity_Status_Code:String,Entity_Status_Description:String,Status:String,Event_Date:String,Event_Date2:String,Event_Date3:String,Event_Description:String)
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("maintenance").setMaster("local")
conf.set("spark.debug.maxToStringFields", "10000000")
val context = new SparkContext(conf)
val sqlContext = new SQLContext(context)
val hiveContext = new HiveContext(context)
sqlContext.clearCache()
//hiveContext.clearCache()
//sqlContext.clearCache()
import hiveContext.implicits._
val rdd = context.textFile("file:///Users/hadoop/Downloads/sample.txt").map(line => line.split(" ")).map(x => event(x(0),x(1),x(2),x(3),x(4),x(5),x(6)))
val personDF = rdd.toDF()
personDF.show(10)
personDF.registerTempTable("Maintenance")
hiveContext.sql("insert into table default.maintenance select Entity_Status_Code,Entity_Status_Description,Status,Event_Date,Event_Date2,Event_Date3,Event_Description from Maintenance")
}
当我评论与hiveContext相关的所有行并在本地运行时(我的意思是personDF.show())它的工作正常。但是当我运行spark-submit并启用hiveContext时会出现错误。
以下是我的示例数据:
4287053 06218896 N 19801222 19810901 19881222 M171
4287053 06218896 N 19801222 19810901 19850211 M170
4289713 06222552 Y 19810105 19810915 19930330 SM02
4289713 06222552 Y 19810105 19810915 19930303 M285
4289713 06222552 Y 19810105 19810915 19921208 RMPN
4289713 06222552 Y 19810105 19810915 19921208 ASPN
4289713 06222552 Y 19810105 19810915 19881116 ASPN
4289713 06222552 Y 19810105 19810915 19881107 M171
答案 0 :(得分:0)
将-1添加到拆分中,这可以解决您的问题(在您计算val rdd = ...的行上): line.split("", - 1)
空的字段将从分割中省略,导致arrayindexoutofbound。