使用pyarrow编写的Parquet文件的Hive Column读取为NULL

时间:2018-08-31 13:40:36

标签: hive parquet presto pyarrow

Pandas Dataframe有两列hero_sku(字符串)和neighbors_list(字符串列表)

>>> visual_skus_to_knns_df.head()
   hero_sku                                     neighbors_list
0  IVBX6548  [IVBX6548, IVBX6511, IVBX6535, IVBX6391, IVBX6...
1  IVBX6549  [IVBX6549, IVBX6512, IVBX6536, IVBX6448, IVBX6...
2  BLMS1270  [BLMS1270, FRUP1958, SADL1011, BLMK3080, BLMK4...
3  EUNH6179  [EUNH6179, ETUB7716, URBH6598, BLGA1031, FAV19...
4  IVBX6540  [IVBX6540, IVBX6515, IVBX6502, IVBX6552, IVBX5...

使用镶木地板工具从镶木地板文件中获取方案/头信息:

$ parquet-tools schema /tmp/rmore/pyarrow_example/demo_rupesh.parquet
message schema {
  optional binary hero_sku (UTF8);
  optional group neighbors_list (LIST) {
    repeated group list {
      optional binary item (UTF8);
    }
  }
  optional int64 __index_level_0__;
}

$ parquet-tools head /tmp/rmore/pyarrow_example/demo_rupesh.parquet
hero_sku = IVBX6548
neighbors_list:
.list:
..item = IVBX6548
.list:
..item = IVBX6511
.list:
..item = IVBX6535
.list:
..item = IVBX6391
.list:
..item = IVBX6488
.list:
..item = IVBX6460
.list:
..item = IVBX6475
.list:
..item = IVBX6380
.list:
..item = IVBX6402
.list:
..item = IVBX6502
.list:
..item = IVBX6393
.list:
..item = IVBX5206
.list:
..item = IVBX6526
.list:
..item = IVBX6412
.list:
..item = IVBX6389
.list:
..item = IVBX6425
.list:
..item = IVBX6446
.list:
..item = IVBX6540
.list:
..item = IVBX6515
.list:
..item = IVBX6414
.list:
..item = IVBX5035
__index_level_0__ = 0

创建的Hive表:

CREATE EXTERNAL TABLE `rmore.hd_visual_skus_to_knns_1`(
  `hero_sku` string, 
  `neighbor_list` array<string>)
STORED AS PARQUET
LOCATION
  '/tmp/rmore/pyarrow_example';

从配置单元表中进行选择会显示neighbor_list列为NULL:

hive> 
    > select * from rmore.hd_visual_skus_to_knns limit 2;
OK
IVBX6548    NULL
IVBX6549    NULL
Time taken: 0.149 seconds, Fetched: 2 row(s)

在prestodb中,我可以在neighbor_list列中看到数据,Hive和PrestoDB共享同一元存储。我们使用Presto CLI 0.198版本。

PrestoDB query

问题:为什么Hive无法在presto可以选择的地方选择列数据? 我还尝试将neigbors_list列分解为行,但再次无法在蜂巢中选择分解列。 我们使用的Hive版本是cloudera 5.14.2上的1.1.0。 还观察到,在单击配置单元会话后,看到了很多故障:

hive> 
    > 
    > Aug 31, 2018 9:25:27 AM WARNING: parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.3.2-SNAPSHOT
parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-cpp version 1.3.2-SNAPSHOT using format: (.+) version ((.*) )?\(build ?(.*)\)
    at parquet.VersionParser.parse(VersionParser.java:112)
    at parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:66)
    at parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:294)
    at parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:601)
    at parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:578)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
    at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
    at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
    at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
    at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
    at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:674)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:324)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:446)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:415)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:140)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2069)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Aug 31, 2018 9:25:27 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:25:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:25:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:25:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 216 ms. row count = 258319
Aug 31, 2018 9:25:46 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:25:46 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:25:46 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:25:46 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 158 ms. row count = 258319
Aug 31, 2018 9:26:31 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:26:31 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:26:31 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:26:31 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 87 ms. row count = 258319
Aug 31, 2018 9:26:39 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:26:39 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:26:39 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:26:40 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 87 ms. row count = 258319

0 个答案:

没有答案