我在使用以下API代码将数据帧保存到配置单元表时遇到问题。
df.write.mode(SaveMode.Append).format("parquet").partitionBy("ord_deal_year", "ord_deal_month", "ord_deal_day").insertInto(tableName)
我的Dataframe有大约48列。 Hive表有90列的位置。 当我尝试保存Dataframe时,收到以下错误:
12:56:11 Executor task launch worker-0 ERROR Executor:96 Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.ArrayIndexOutOfBoundsException: 51
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.genericGet(rows.scala:253)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:34)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:35)
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.isNullAt(rows.scala:247)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:107)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:104)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:104)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
12:56:11 task-result-getter-3 WARN TaskSetManager:71 Lost task 0.0 in stage 3.0 (TID 3, localhost): java.lang.ArrayIndexOutOfBoundsException: 51
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.genericGet(rows.scala:253)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:34)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:35)
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.isNullAt(rows.scala:247)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:107)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:104)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:104)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
12:56:11 task-result-getter-3 ERROR TaskSetManager:75 Task 0 in stage 3.0 failed 1 times; aborting job
我尝试使用以下代码段添加缺失的列。
val columnsAdded = columns.foldLeft(df) { case (d, c) =>
if (d.columns.contains(c._1)) {
// column exists; skip it
d
} else {
// column is not available so add it
d.withColumn(c._1, lit(null).cast(c._2))
}
}
但同样的问题仍然存在。
我检查了以下问题:Error while trying to save the data to Hive tables from Dataframe和解决方案,与Hive表相比,该解决方案被确定为数据框中的错误架构。
newDF.schema.map{i =>
s"Column ${i.name},${i.dataType}"+
s" Column exists in hive ${hiveSchema.get(i.name).isDefined}" +
s" Hive Table has the correct datatype ${i.dataType == hiveSchema(i.name)}"
}.foreach(i => println(i))
有没有人看过这个问题,或者对如何解决这个问题有任何建议?
答案 0 :(得分:1)
使用insertInto
时,您不必使用partitionBy
。这些列将用于在Hive中进行分区。
顺便说一下,DataFrame
提供了一种方法,可以使用printSchema
开箱即可打印出来的原型。
答案 1 :(得分:1)
我会明确选择填充缺失属性所需的所有其他列。
需要注意的另一件事是,您需要以正确的顺序获取列。 Spark可以编写适合模式的镶木地板文件,但它会忽略您使用的列名称。 所以如果hive有一个:string,b:string 并且你的火花代码生成一个带有“b,a”的DF,它会写得很好,但列的顺序会错误。
所以,结合两个建议,我会添加一个保护子句,按照预期的确切顺序选择hive在元数据中的确切列表 - 就在写出/ insertInto之前。