SparkSQL无法从Hive中的ORC表中读取特定列

时间:2017-06-30 09:50:20

标签: hive apache-spark-sql orc

我正在使用SparkSQL 2.1.1从存储在Google云端存储中的Hive 1.2.1中的ORC表中进行读取。我可以成功选择大多数列,除了col1类型的一个列(此处称为smallint)。如果我尝试使用此代码选择该特定列

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hc.sql("SELECT col1 FROM table")
result.collect().foreach(println)

此异常将失败:

  

org.apache.spark.SparkException:作业因阶段失败而中止:   阶段24.0中的任务0失败4次,最近失败:丢失任务   阶段24.0中的0.3(TID 378,执行程序42):java.lang.ClassCastException:org.apache.hadoop.io.IntWritable不能   被强制转换为org.apache.hadoop.hive.serde2.io.ShortWritable         at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableShortObjectInspector.get(WritableShortObjectInspector.java:36)         在org.apache.spark.sql.hive.HadoopTableReader $$ anonfun $ 14 $$ anonfun $ apply $ 4.apply(TableReader.scala:390)         在org.apache.spark.sql.hive.HadoopTableReader $$ anonfun $ 14 $$ anonfun $ apply $ 4.apply(TableReader.scala:390)         在org.apache.spark.sql.hive.HadoopTableReader $$ anonfun $ fillObject $ 2.apply(TableReader.scala:435)         在org.apache.spark.sql.hive.HadoopTableReader $$ anonfun $ fillObject $ 2.apply(TableReader.scala:426)         在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409)         在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409)         在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:232)         在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:225)         在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)         在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)         在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)         在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)         在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)         在org.apache.spark.scheduler.Task.run(Task.scala:99)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)         at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)         在java.lang.Thread.run(Thread.java:748)

我已尝试将该列投射到short但未成功

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hc.sql("SELECT cast(col1 as short) FROM table")
result.collect().foreach(println)

0 个答案:

没有答案