使用pyspark

时间:2019-05-03 15:02:06

标签: apache-spark hive pyspark amazon-emr

读取配置单元列中带有绽放过滤器的列时遇到的问题。

表格采用 ORC 格式。

CREATE EXTERNAL TABLE test (
  id bigint,
  name string)
PARTITIONED BY ( 
  month_end_year int, 
  month_end_month int, 
  month_end_date date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://<some location>/test'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id')

遵循pyspark代码

spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
df = spark.sql('select * from default.test')
df.show()

获取异常

  

由于:java.lang.RuntimeException:   java.lang.IllegalArgumentException在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl $ BufferChunk.sliceAndShift(RecordReaderImpl.java:871)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.getStreamBuffers(RecordReaderUtils.java:332)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.createStreams(RecordReaderImpl.java:947)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:969)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:793)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019)     在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl。(RecordReaderImpl.java:205)     在   org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:541)     在   org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger $ ReaderPair。(OrcRawRecordMerger.java:183)     在   org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger $ OriginalReaderPair。(OrcRawRecordMerger.java:226)     在   org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger。(OrcRawRecordMerger.java:437)     在   org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1216)     在   org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)     在   org.apache.spark.rdd.HadoopRDD $$ anon $ 1.liftedTree1 $ 1(HadoopRDD.scala:257)     在org.apache.spark.rdd.HadoopRDD $$ anon $ 1。(HadoopRDD.scala:256)     在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)处   org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)在   org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在   org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)在   org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在   org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)     在org.apache.spark.scheduler.Task.run(Task.scala:109)处   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     ... 1更多原因:java.lang.IllegalArgumentException在   java.nio.Buffer.limit(Buffer.java:275)在   org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl $ BufferChunk.sliceAndShift(RecordReaderImpl.java:865)     ...另外44个

配置单元版本:配置单元2.3.3-amzn-2

火花版本:版本2.3.2

0 个答案:

没有答案