使用子查询进行查询的Hive JOIN需要永远

时间:2015-06-10 15:53:10

标签: hadoop hive

最近我和Hive玩了一会儿。然而,当我尝试转换类似

之类的东西时,大多数事情都进展顺利
2015-04-01   device1   traffic    other       start   
2015-04-01   device1   traffic    violation   deny 
2015-04-01   device1   traffic    violation   deny 
2015-04-02   device1   traffic    other       start   
2015-04-03   device1   traffic    other       start   
2015-04-03   device1   traffic    other       start   

2015-04-01   1       2
2015-04-02   1       
2015-04-03   2       

我尝试使用以下查询但由于某种原因,无论我等待多久,查询的reduce阶段都会停滞在96%。

SELECT pass.date, COUNT(pass.type), COUNT(deny.deny_type) FROM firewall_logs as pass
JOIN (
SELECT date, type as deny_type FROM firewall_logs
WHERE device = 'device1' 
AND date LIKE '2015-04-%'
AND type = 'traffic' AND subtype = 'violation' and status = 'deny' 
) deny ON ( pass.date = deny.date  )
WHERE pass.device = 'device1' 
AND pass.date LIKE '2015-04-%'
AND pass.type = 'traffic' AND pass.subtype = 'other' AND pass.status = 'start'
GROUP BY pass.date ORDER BY pass.date ;

所有MR2日志显示为:

2015-06-11 01:54:04,206 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 9028000 rows for join key [2015-04-26]
2015-06-11 01:54:04,423 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 9128000 rows for join key [2015-04-26]
2015-06-11 01:54:04,638 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 9228000 rows for join key [2015-04-26]
2015-06-11 01:54:04,838 INFO [main] org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1

有人会知道为什么吗?

1 个答案:

答案 0 :(得分:1)

我尽量避免像瘟疫那样在Hive中自我加入。您可以通过收集和创建地图来实现此目的

add jar ./brickhouse-0.7.1.jar;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select date
  , c_map['start'] starts
  , c_map['deny'] denies
from (
  select date
    , collect(status, c) c_map
  from (
    select date, status
      , count( subtype ) c
    from table
    where device='device1' and type='traffic'
    group by date, status ) x
  group by date ) y