Pandas Dataframe有两列hero_sku
(字符串)和neighbors_list
(字符串列表)
>>> visual_skus_to_knns_df.head()
hero_sku neighbors_list
0 IVBX6548 [IVBX6548, IVBX6511, IVBX6535, IVBX6391, IVBX6...
1 IVBX6549 [IVBX6549, IVBX6512, IVBX6536, IVBX6448, IVBX6...
2 BLMS1270 [BLMS1270, FRUP1958, SADL1011, BLMK3080, BLMK4...
3 EUNH6179 [EUNH6179, ETUB7716, URBH6598, BLGA1031, FAV19...
4 IVBX6540 [IVBX6540, IVBX6515, IVBX6502, IVBX6552, IVBX5...
使用镶木地板工具从镶木地板文件中获取方案/头信息:
$ parquet-tools schema /tmp/rmore/pyarrow_example/demo_rupesh.parquet
message schema {
optional binary hero_sku (UTF8);
optional group neighbors_list (LIST) {
repeated group list {
optional binary item (UTF8);
}
}
optional int64 __index_level_0__;
}
$ parquet-tools head /tmp/rmore/pyarrow_example/demo_rupesh.parquet
hero_sku = IVBX6548
neighbors_list:
.list:
..item = IVBX6548
.list:
..item = IVBX6511
.list:
..item = IVBX6535
.list:
..item = IVBX6391
.list:
..item = IVBX6488
.list:
..item = IVBX6460
.list:
..item = IVBX6475
.list:
..item = IVBX6380
.list:
..item = IVBX6402
.list:
..item = IVBX6502
.list:
..item = IVBX6393
.list:
..item = IVBX5206
.list:
..item = IVBX6526
.list:
..item = IVBX6412
.list:
..item = IVBX6389
.list:
..item = IVBX6425
.list:
..item = IVBX6446
.list:
..item = IVBX6540
.list:
..item = IVBX6515
.list:
..item = IVBX6414
.list:
..item = IVBX5035
__index_level_0__ = 0
创建的Hive表:
CREATE EXTERNAL TABLE `rmore.hd_visual_skus_to_knns_1`(
`hero_sku` string,
`neighbor_list` array<string>)
STORED AS PARQUET
LOCATION
'/tmp/rmore/pyarrow_example';
从配置单元表中进行选择会显示neighbor_list列为NULL:
hive>
> select * from rmore.hd_visual_skus_to_knns limit 2;
OK
IVBX6548 NULL
IVBX6549 NULL
Time taken: 0.149 seconds, Fetched: 2 row(s)
在prestodb中,我可以在neighbor_list列中看到数据,Hive和PrestoDB共享同一元存储。我们使用Presto CLI 0.198版本。
问题:为什么Hive无法在presto可以选择的地方选择列数据? 我还尝试将neigbors_list列分解为行,但再次无法在蜂巢中选择分解列。 我们使用的Hive版本是cloudera 5.14.2上的1.1.0。 还观察到,在单击配置单元会话后,看到了很多故障:
hive>
>
> Aug 31, 2018 9:25:27 AM WARNING: parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.3.2-SNAPSHOT
parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-cpp version 1.3.2-SNAPSHOT using format: (.+) version ((.*) )?\(build ?(.*)\)
at parquet.VersionParser.parse(VersionParser.java:112)
at parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:66)
at parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:294)
at parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:601)
at parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:578)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:674)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:324)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:446)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:415)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:140)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2069)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Aug 31, 2018 9:25:27 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:25:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:25:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:25:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 216 ms. row count = 258319
Aug 31, 2018 9:25:46 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:25:46 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:25:46 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:25:46 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 158 ms. row count = 258319
Aug 31, 2018 9:26:31 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:26:31 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:26:31 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:26:31 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 87 ms. row count = 258319
Aug 31, 2018 9:26:39 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 31, 2018 9:26:39 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 258319 records.
Aug 31, 2018 9:26:39 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Aug 31, 2018 9:26:40 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 87 ms. row count = 258319