如何评估HIVE中的CTE(通用表格表达)

时间:2017-02-27 12:29:15

标签: hive common-table-expression

我的问题是关于性能以及在运行时评估CTE的方式。

我计划通过定义基本投影来重用代码,然后使用不同的过滤器在此基本投影的顶部定义多个CTE。

这是否会导致任何性能问题。更具体地说,每次都会评估基本投影。

例如:

WITH CTE_PERSON as (
   SELECT * FROM PersonTable
),


CTE_PERSON_WITH_AGE as (
   SELECT * FROM CTE_PERSON WHERE age > 24 
),

CTE_PERSON_WITH_AGE_AND_GENDER as (
  SELECT * FROM CTE_PERSON_WITH_AGE WHERE gender = 'm'
),

CTE_PERSON_WITH_NAME as (
  SELECT * FROM CTE_PERSON WHERE name = 'abc'
)
  • 每次来自PersonTable的所有条目都会被加载 进入内存然后过滤器将被应用 (或)
  • 只有 过滤器后的结果集将加载到内存中。

1 个答案:

答案 0 :(得分:7)

单次扫描。

注意:
  - 单个阶段
  - 单TableScan
  - predicate: (((i = 1) and (j = 2)) and (k = 3)) (type: boolean)

create table t (i int,j int,k int);
explain 
with    t1 as (select i,j,k from t  where i=1)
       ,t2 as (select i,j,k from t1 where j=2)
       ,t3 as (select i,j,k from t2 where k=3) 

select * from t3
;
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: t
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Filter Operator
            predicate: (((i = 1) and (j = 2)) and (k = 3)) (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: 1 (type: int), 2 (type: int), 3 (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              ListSink