在下面的查询中,我将表T1连接到基于相同键的多个表。我想知道在这种情况下我需要指定条件
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
每次加入?不这样做的理由是什么?
INSERT OVERWRITE TABLE tab1
PARTITION(ds='2014-12-10')
SELECT
a.var1
, b.var2
, c.var3
, d.var4
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
答案 0 :(得分:0)
不,你没有。您可以简单地将这些条件移动到"其中"整个查询的子句。检查解释计划,它将与您目前的相同。
代码示例:
INSERT OVERWRITE TABLE tab1
PARTITION(ds='2014-12-10')
SELECT
a.var1
, b.var2
, c.var3
, d.var4
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
**WHERE
a.ds = '2014-12-10'
AND a.org_id IS NULL**
答案 1 :(得分:-1)
这是我发现的:PredicatePushDown有这个概念(我不是100%肯定的。但它在新版本的Hive中是默认的)。如果我用hive.optimize.ppd = true设置它;然后我在两种情况下得到相同的表现:
我的情况:条件在所有连接中指定 结果:
- 24254行加载到tab1
- 启动MapReduce工作:
- 作业0:地图:16减少:4累积CPU:802.6秒HDFS读取:3020743758 HDFS写入:900057成功
- 工作1:地图:1累积CPU:4.93秒HDFS读取:965541 HDFS写:898430成功
- 累计使用的MapReduce CPU时间:13分27秒530毫秒
INSERT OVERWRITE TABLE tab1
blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
II情况:条件仅在第一次加入时指定
结果:
- 24254
- 作业0:地图:16减少:4累积CPU:803.35秒HDFS读取:3020743758 HDFS写入:900057成功
- 工作1:地图:1累积CPU:3.75秒HDFS读取:965541 HDFS写:898429成功
- 总MapReduce CPU耗时:13分27秒100毫秒
INSERT OVERWRITE TABLE tab1
blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'