Question

我在CDH群集上有数据集，它由yyyymm分区。

当我在hive上运行以下查询时：

select actvydt, cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int) from pos where yyyymm=201601 and actvydt>='2016-01-01' and actvydt<='2016-01-09' limit 10;

它正在从数据集中击中201601的正确分区。

以下是结果：

actvydt     yyyymm
2016-01-02  201601
2016-01-02  201601
2016-01-02  201601

但是当我运行以下查询:(只是通过subst和concat函数传递yyyymm的参数）

select actvydt,cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int) from pos.pos_sales_weekly where yyyymm=cast(trim((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2))))) as int) and actvydt>='2016-01-01' and actvydt<='2016-01-09' limit 10;

它正在击中整个数据集。所以yyyymm的值没有正确传递。这个功能有一些问题：

 cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int)

但是函数的值作为列传递，可以在上面的结果中看到。它显示正确的参数201601。任何帮助都会非常值得赞赏。

下面是表架构： CREATE EXTERNAL TABLE IF NOT EXISTS pos (nid bigint, actvydt date, upc string, tchid string, posfileid string, yssk bigint) PARTITIONED BY (yyyymm int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/data/' TBLPROPERTIES ( 'avro.output.codec'='snappy' );

Answer 1

在查询执行之前必须知道分区键值才能使分区修剪生效。您正在使用WHERE子句：yyyymm=cast(trim((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2))))) as int) and actvydt>='2016-01-01' and actvydt<='2016-01-09'

遗憾的是，优化程序没有这样的智能来在查询执行之前从相当复杂的函数中推断yyyymm值。尝试另外添加显式条件：yyyymm='201601'这将起作用。您可以将其作为变量传递。

Answer 2

某处，某种程度上，创建了值2016-01-01。

恰好在那一刻，或者非常接近它，您还应该能够创建201601。

执行此操作后，您可以按照与传递2016-01-01相同的方式将其传递给查询，然后您的问题就会得到解决。

蜂巢中的分区

2 个答案: