Question

假设我有两个分区表，分别为customer和items，并且这两个分区都由country和state列分区。

这是否是加入这些表内容的正确方法，因为我想检索特定国家/地区的数据？

select 
  customer.id, 
  customer.name, 
  items.name, 
  items.value
from
  customers
  join items
  on customers.id == items.customer_id
  and customers.country == 'USA'
  and customers.state == 'TX'
  and items.country == 'USA'
  and items.state == 'TX'

或者这些条件应该在WHERE子句中吗？

and customers.country == 'USA'
and customers.state == 'TX'
and items.country == 'USA'
and items.state == 'TX'

Answer 1

对于简单查询，Hive会在reduce阶段之前推送谓词，因此在这种情况下，将条件置于＆＃34; on＆＃34;之间的性能将相同。或者在＆＃34;其中＆＃34;条款。但是，如果您在比较表之间的字段（table1.a＆lt; table2.b）时编写其他查询，Hive将执行连接并在结束时应用where条件（reducer阶段），就像大多数关系数据库一样。

Answer 2

我们可以连接分区表，分区只不过是文件夹结构，分区是指根据特定列的值将表分为相关部分的方式，例如日期，状态等。对于前，我有下面的分区

show partitions table_name1 
year=2016/month=12/day=1/part=10

show partitions table_name2 
year=2016/month=12/day=1/part=1

现在我们可以通过以下方式联接表

select i.col1, c.col1
FROM (SELECT * FROM table_name1 WHERE year=2016 AND month=12 AND day=1) i
JOIN (SELECT * FROM table_name2 WHERE year=2016 AND month=12 AND day=1) c
ON i.col2= c.col2
AND i.col3= c.col3
AND i.col3= c.col3
GROUP BY c.col1

OR

SELECT i.col1, c.col1
FROM table_name1
JOIN table_name2
ON i.col2= c.col2
AND i.col3= c.col3
AND i.col3= c.col3
AND i.year=2016 AND i.month=12 AND i.day=1
AND c.year=2016 AND c.month=12 AND c.day=1
GROUP BY c.col1

在Hive中加入分区表

2 个答案: