Hive查询语言仅返回不喜欢另一个表中的值的值

时间:2015-02-24 05:33:28

标签: hadoop hive cloudera hiveql impala

我正在尝试查找hosts表中的所有值,这些值与我的maildomains表中的值不包含部分匹配。

hosts
+-------------------+-------+
|       host        | score |
+-------------------+-------+
| www.gmail.com     |   489 |
| www.hotmail.com   |   653 |
| www.google.com    |   411 |
| w3.hotmail.ca     |   223 |
| stackexchange.com |   950 |
+-------------------+-------+
maildomains 
+---------------+
| email         |
+---------------+
| gmail         |
| hotmail       |
| outlook       |
| mail          |
+---------------+

具体来说,我希望做主机的SELECT *,其中hosts.host不喜欢'%中的任何值.maildomains.email%'

Desired output:
+-------------------+-------+
|       host        | score |
+-------------------+-------+
| www.google.com    |   411 |
| stackexchange.com |   950 |
+-------------------+-------+

以下是我认为它应该按逻辑运作的方式:

SELECT h.*, m.email FROM (SELECT h.* FROM hosts WHERE score > 100 as h)
h LEFT OUTER JOIN maildomains m ON (h.host LIKE CONCAT('%.',m.email,'%'))
WHERE m.email IS NULL

这导致错误10017:在加入时遇到左右别名'%''

我还设法让一个类似的查询在没有错误的情况下运行为CROSS JOIN,但它会产生错误的结果:

SELECT h.*, m.email FROM (SELECT h.* FROM hosts WHERE score > 100 as h)
h CROSS JOIN maildomains m 
WHERE h.host NOT LIKE CONCAT('%.',m.email,'%')

+-------------------+---------+---------+
|      p.host       | p.score | m.email |
+-------------------+---------+---------+
| www.gmail.com     |     489 | hotmail |
| www.gmail.com     |     489 | outlook |
| www.gmail.com     |     489 | mail    |
| www.hotmail.com   |     653 | gmail   |
| www.hotmail.com   |     653 | outlook |
| www.hotmail.com   |     653 | mail    |
| www.google.com    |     411 | gmail   |
| www.google.com    |     411 | hotmail |
| www.google.com    |     411 | outlook |
| www.google.com    |     411 | mail    |
| w3.hotmail.ca     |     223 | gmail   |
| w3.hotmail.ca     |     223 | outlook |
| w3.hotmail.ca     |     223 | mail    |
| stackexchange.com |     950 | gmail   |
| stackexchange.com |     950 | hotmail |
| stackexchange.com |     950 | outlook |
| stackexchange.com |     950 | mail    |
+-------------------+---------+---------+

我感谢任何指导。

2 个答案:

答案 0 :(得分:1)

你可以这样做:

select host from hosts h left outer join maildomains m on (regexp_replace(regexp_replace(regexp_replace(regexp_replace(h.host,'www.',''),'.com',''),'.ca',''),'w3.','') = m.email) where email is NULL;

答案 1 :(得分:0)

如果你的Hive版本是0.13或更新,那么你可以使用WHERE子句中的subquery来过滤hosts表中的行。以下是一种更通用的方法,不需要您枚举您可能在数据中找到的所有顶级域名:

SELECT host, score
FROM hosts
WHERE
  regexp_extract(hosts.host, "(?:.*?\\.)?([^.]+)\\.[^.]+", 1) NOT IN
    (SELECT email FROM maildomains);

此方法将TLD之前的主机域部分与regexp_extract隔离,然后检查该域名是否出现在maildomains表的子查询中。