在Spark程序中显示Hive表中的数据时出错:DoubleWritable无法强制转换为Text

时间:2018-08-14 03:45:47

标签: scala apache-spark hive

当我尝试从Scala程序中的Hive分区表中卸载数据时,出现以下错误。

错误:org.apache.hadoop.hive.serde2.io.DoubleWritable无法转换为org.apache.hadoop.io.Text

以下是导致上述错误的步骤:

1)从具有多个数据类型的n列的DB2表中卸载到数据框

Val dbDF = select col1, col2,... from my_DB2;  
dbDF.printschema  
root  
 -col1 string  
 -col2 Int  
  .  
  .  
 -col10 long 

2)现在所有列都被强制转换为数据类型String

 val castdbDF = dbDF.cast all columns to String

3)然后我在上述数据框上进行汇总

val aggrDF = castdbDF.aggr()  

4)然后,我将剩下的数据帧与dbDF重新连接起来,以获取所有列。

val reconDF = aggrDF.as(df1).join(castdbDF.as(df2), Seq(keys), left).select(df2.*, df1.aggcolumns) 

5)此数据帧已加载到Hive分区表

reconDF.repartition(2).write.mode(SaveMode.Append).orc(tblpath)  
Hive Table:[datatype of all columns is string]  
Create external table if not exists tbl1  
(col1 string,   
 ....  
)   
partitioned by (processing_date string)  
stored as orc location "hdfs location"  
Then I run MSCK REPAIR TABLE tbl1

在另一个scala流程中,我正在将数据从Hive表卸载到Dataframe

val hiveDF = spark.sql(select * from tbl1 where processing_date='date')  
hiveDF.show()  

上面显示的步骤似乎是错误出现的位置,不确定导致错误的原因/原因。

感谢任何人都可以提供帮助。

以下是完整错误:

 org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)

    at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)

    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)

    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

    at org.apache.spark.scheduler.Task.run(Task.scala:108)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

驱动程序堆栈跟踪:

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)

    at scala.Option.foreach(Option.scala:257)

    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)

    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)

    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)

    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)

    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)

    at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)

    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)

    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)

    at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)

    at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)

    at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)

    at org.apache.spark.sql.Dataset.show(Dataset.scala:637)

    at org.apache.spark.sql.Dataset.show(Dataset.scala:596)

    at com.cmsmasubm.ma_clms_inst_trspy$.main(ma_clms_inst_trspy.scala:281)

    at com.cmsmasubm.ma_clms_inst_trspy.main(ma_clms_inst_trspy.scala)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:498)

    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)

    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)

    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

原因:java.lang.ClassCastException:org.apache.hadoop.hive.serde2.io.DoubleWritable无法转换为org.apache.hadoop.io.Text

    at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)

    at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)

    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)

    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)

    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

    at org.apache.spark.scheduler.Task.run(Task.scala:108)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

0 个答案:

没有答案