读取配置单元列中带有绽放过滤器的列时遇到的问题。
表格采用 ORC 格式。
CREATE EXTERNAL TABLE test (
id bigint,
name string)
PARTITIONED BY (
month_end_year int,
month_end_month int,
month_end_date date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://<some location>/test'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id')
遵循pyspark代码
spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
df = spark.sql('select * from default.test')
df.show()
获取异常
由于:java.lang.RuntimeException: java.lang.IllegalArgumentException在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl $ BufferChunk.sliceAndShift(RecordReaderImpl.java:871) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.getStreamBuffers(RecordReaderUtils.java:332) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.createStreams(RecordReaderImpl.java:947) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:969) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:793) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019) 在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl。(RecordReaderImpl.java:205) 在 org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:541) 在 org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger $ ReaderPair。(OrcRawRecordMerger.java:183) 在 org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger $ OriginalReaderPair。(OrcRawRecordMerger.java:226) 在 org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger。(OrcRawRecordMerger.java:437) 在 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1216) 在 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) 在 org.apache.spark.rdd.HadoopRDD $$ anon $ 1.liftedTree1 $ 1(HadoopRDD.scala:257) 在org.apache.spark.rdd.HadoopRDD $$ anon $ 1。(HadoopRDD.scala:256) 在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)处 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 在 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 在org.apache.spark.scheduler.Task.run(Task.scala:109)处 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) ... 1更多原因:java.lang.IllegalArgumentException在 java.nio.Buffer.limit(Buffer.java:275)在 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl $ BufferChunk.sliceAndShift(RecordReaderImpl.java:865) ...另外44个
配置单元版本:配置单元2.3.3-amzn-2
火花版本:版本2.3.2