无法查询Presto db中具有嵌套字段的镶木地板数据

时间:2018-08-19 14:50:52

标签: hive parquet presto

我有数据,每个数据都包含嵌套列(对象数组的数组),在Spark 2.2中另存为PARQUET。

现在,我尝试使用presto从外部访问此数据,并且在尝试访问任何嵌套列时遇到以下异常。

com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://name-node/parquet_path/part-00023-8d4f14b1-a3f1-4055-b931-04838701048d-c000.snappy.parquet (offset=0, length=108289): parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:220)
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:115)
    at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:157)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:93)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:239)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
    at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
 Caused by: java.lang.ClassCastException: parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:56)
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:90)
 at com.facebook.presto.hive.parquet.ParquetPageSource.<init>(ParquetPageSource.java:109)

有趣的是,我能够毫无问题地查询其他非嵌套列。

创建表如下所示:

CREATE TABLE hive.tests.table_name (
not_nested_field_1 BIGINT,
not_nested_field_2 BIGINT,
not_nested_field_3 BOOLEAN,
not_nested_field_4 DOUBLE,
not_nested_field_5 ARRAY(VARCHAR),
not_nested_field_5 ARRAY(ROW(
    nested_level0_field1 BOOLEAN,
    nested_level0_field2 BIGINT,
    nested_level0_field3 BIGINT,
    nested_level0_field4 ARRAY(ROW(
        nested_level1_field1 BOOLEAN,
        nested_level1_field2 BIGINT,
        nested_level1_field3 VARCHAR,
        nested_level1_field4 ARRAY(ROW(
            nested_level2_field1 VARCHAR,
            nested_level2_field2 VARCHAR,
            nested_level2_field3 ARRAY(ROW(
                nested_level3_field1 VARCHAR,
                nested_level3_field2 VARCHAR)))),
        nested_level1_field5 ARRAY(ROW(
            nested_level2_field4 BIGINT,
            nested_level2_field5 BIGINT,
            nested_level2_field6 ARRAY(ROW(
                nested_level3_field3 VARCHAR,
                nested_level3_field4 VARCHAR)))))))))
WITH (
  format = 'PARQUET',
  external_location = 'hdfs://name-node/parquet_path/'
);

使用presto版本0.208,使用本地Hive Metastore创建外部表。

任何帮助将不胜感激:)

1 个答案:

答案 0 :(得分:3)

使用hive.parquet.use-column-names=true中定义的catalog/hive.properties属性解决了该问题

默认情况下,presto将使用列索引来访问数据,因此需要显式定义此属性,以便它将在CREATE TABLE中定义的镶木地板中使用列名称。