HIVE中具有相同表的多个联接

时间:2014-12-16 22:58:27

标签: hive hiveql

在下面的查询中,我将表T1连接到基于相同键的多个表。我想知道在这种情况下我需要指定条件

 AND a.ds = '2014-12-10'
 AND a.org_id IS NULL

每次加入?不这样做的理由是什么?

INSERT OVERWRITE TABLE tab1
        PARTITION(ds='2014-12-10')
    SELECT
        a.var1
        , b.var2
        , c.var3
        , d.var4

    FROM T1 a
    LEFT OUTER JOIN T2 b
        ON a.var1 = settings.var1

        AND a.ds = '2014-12-10'
        AND a.org_id IS NULL

    LEFT OUTER JOIN T3 c
        ON a.var1 = bmid.var1
        AND c.ds = '2014-12-10'

        AND a.ds = '2014-12-10'
        AND a.org_id IS NULL

    LEFT OUTER JOIN T4 d
        ON a.var1 = daa.var1
        AND d.ds = '2014-12-10'

        AND a.ds = '2014-12-10'
        AND a.org_id IS NULL

2 个答案:

答案 0 :(得分:0)

不,你没有。您可以简单地将这些条件移动到"其中"整个查询的子句。检查解释计划,它将与您目前的相同。

代码示例:

INSERT OVERWRITE TABLE tab1
        PARTITION(ds='2014-12-10')
    SELECT
        a.var1
        , b.var2
        , c.var3
        , d.var4
    FROM T1 a
    LEFT OUTER JOIN T2 b
        ON a.var1 = settings.var1
    LEFT OUTER JOIN T3 c
        ON a.var1 = bmid.var1
        AND c.ds = '2014-12-10'
    LEFT OUTER JOIN T4 d
        ON a.var1 = daa.var1
        AND d.ds = '2014-12-10'
    **WHERE
        a.ds = '2014-12-10'
        AND a.org_id IS NULL**

答案 1 :(得分:-1)

这是我发现的:PredicatePushDown有这个概念(我不是100%肯定的。但它在新版本的Hive中是默认的)。如果我用hive.optimize.ppd = true设置它;然后我在两种情况下得到相同的表现:

我的情况:条件在所有连接中指定 结果:

- 24254行加载到tab1

- 启动MapReduce工作:

- 作业0:地图:16减少:4累积CPU:802.6秒HDFS读取:3020743758 HDFS写入:900057成功

- 工作1:地图:1累积CPU:4.93秒HDFS读取:965541 HDFS写:898430成功

- 累计使用的MapReduce CPU时间:13分27秒530毫秒

INSERT OVERWRITE TABLE tab1
    blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
    ON a.var1 = settings.var1

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
    ON a.var1 = bmid.var1
    AND c.ds = '2014-12-10'

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
LEFT OUTER JOIN T4 d
    ON a.var1 = daa.var1
    AND d.ds = '2014-12-10'

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL

II情况:条件仅在第一次加入时指定

结果:

- 24254

- 作业0:地图:16减少:4累积CPU:803.35秒HDFS读取:3020743758 HDFS写入:900057成功

- 工作1:地图:1累积CPU:3.75秒HDFS读取:965541 HDFS写:898429成功

- 总MapReduce CPU耗时:13分27秒100毫秒

INSERT OVERWRITE TABLE tab1
    blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
    ON a.var1 = settings.var1

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
    ON a.var1 = bmid.var1
    AND c.ds = '2014-12-10'

LEFT OUTER JOIN T4 d
    ON a.var1 = daa.var1
    AND d.ds = '2014-12-10'