Hive:哪里+ in不使用分区?

时间:2016-04-27 16:59:23

标签: hive partitioning

我正在查询在名为day的字段上分区的大表。 如果我运行查询:

select * from my_table where day in ('2016-04-01', '2016-03-01')

我得到许多映射器和缩减器,查询需要很长时间才能运行。

但是,如果我写了一个查询:

select * from my_table where day = '2016-04-01' or day = '2016-03-01'

我的映射器和缩减器要少得多,查询运行得很快。 对我来说,这表明in没有利用表格中的分区。谁能证实这一点并解释原因?

Hive版本:1.2.1
Hadoop版本:2.3.4.7-4

详细说明:
我相信执行计划的相关部分是......
使用Where or
根本没有过滤器操作员

使用Where in
Filter Operator predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean) Statistics: Num rows: 100000000 Data size: 9999999999

The hive docs只是说:
'查询中使用的分区由系统根据分区列的where子句条件自动确定。'

但不要详细说明。我找不到与此直接相关的任何SO帖子。

谢谢!

1 个答案:

答案 0 :(得分:0)

tl; dr

我正在将 Hive 1.1.0 Cloudera 5.13.3 IN一起使用与equal运算符(=)相同的优化根据我在顺化的解释计划。


示例

我的表在LOAD_YEAR (SMALLINT)LOAD_MONTH (TINYINT)上分区,并具有以下两个分区:

  1. load_year=2018/load_month=10(19,828,71行)
  2. load_year=2018/load_month=11(702,856行)

下面是各种查询及其说明计划。

1。等于(=)运算符

查询:

SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y

说明计划:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (load_month = 11) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

2。 IN运算符

查询(请注意,数据中没有第12个月):

SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)

说明计划:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (load_month = 11) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

3。等于(=与AND和OR结合使用

查询:

SELECT ID
FROM TBL
WHERE
    (LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR  (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)

说明计划:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

4。算术运算

查询:

SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)

旁注:

100没有后缀,因此它是一个INT,而(LOAD_YEAR * 100 + LOAD_MONTH)也是一个INT。这样可以确保结果准确。由于LOAD_YEARSMALLINT,而LOAD_MONTHTINYINT,因此两者的算术计算将SMALLINT用于结果,并且存储的最大值为32,767(不够用) yyyymm的字符,它需要6位数字,即至少为999,999。以100作为INT时,将使用INT类型进行计算,最多允许2,147,483,647。

说明计划:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (201811) IN (201811, 201901) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

摘要

所有这些查询仅扫描第二个分区,从而避免了另一个分区中的约2000万行。