Hive连接或子查询混淆

时间:2016-04-26 00:18:12

标签: sql hadoop hive

(SELECT
    id,
    SUM(hits / ab) AS HAB
FROM batting
GROUP BY id
) b 

SELECT id, bmonth, bstate FROM master a

 WHERE bmonth >= 0 AND bstate is NOT NULL
 GROUP By bmonth,bstate

到目前为止,我有这种乱码,但我迷失了如何形成连接然后继续。我不知道从哪里开始到目前为止。我们应该加入还是使用子查询?请协助

下面查看架构:

CREATE EXTERNAL TABLE IF NOT EXISTS batting
    (id STRING, year INT, team STRING,
    league STRING, games INT, ab INT, runs INT, hits INT, doubles INT, triples INT, 
    homeruns INT, rbi INT, sb INT, cs INT, walks INT, strikeouts INT, ibb INT, 
    hbp INT, sh INT, sf INT, gidp INT) 
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LOCATION '/home/hduser/hivetest/batting';

CREATE EXTERNAL TABLE IF NOT EXISTS master
    (id STRING, byear INT, bmonth INT, bday INT, bcountry STRING, bstate STRING, 
    bcity STRING, dyear INT, dmonth INT, dday INT, dcountry STRING, dstate STRING, 
    dcity STRING, fname STRING, lname STRING, name STRING, weight INT, height INT, 
    bats STRING, throws STRING, debut STRING, finalgame STRING, retro STRING, 
    bbref STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/hivetest/master';

2 个答案:

答案 0 :(得分:1)

首先确保至少3个玩家来自同一个州和同一个月。您必须从主表中获取该集合。为每个州/月计算ID并过滤结果,其中count(id)> = 3

select bstate,bmonth from master
group by bstate,bmonth
having count(id) >=3 

然后你必须用上面的集合,按月,状态和顺序加入击球表,用总和(命中)/总和(击球),然后得到第一行。

select a.bmonth,a.bstate,SUM(c.hits)/SUM(b.bats) hb
from (select bmonth,bstate from master
      group by bmonth,bstate
      having count(id) >=3) a
join master b on a.bstate=b.state and a.month = b.month
join batting c on b.id = c.id
group by a.bmonth,a.bstate
order by hb
limit 1;

答案 1 :(得分:0)

这是查询

select id, sum(hits)/sum(ab) as output from (select m.id, b.ab, b.hits from master m, batting b where m.id = b.id and m.bmonth >= 0 AND m.bstate is NOT NULL) group by id