Hive查询使用“not column = value”where子句删除空值

时间:2017-03-30 14:19:10

标签: sql hadoop hive apache-spark-sql bigdata

table1数据示例:

year month day utmsource
2017 03    26  NULL
2017 03    27  NULL
2017 03    27  facebook
2017 03    27  newsletter
2017 03    27  banner
2017 03    27  facebook    

预期选择:

year month day utmsource
2017 03    27  NULL
2017 03    27  newsletter
2017 03    27  banner 

我的Hive查询:

-- result = 0, it did not include the NULL utmsource record
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource="facebook"

-- result = 1 the NULL utmsource record is included
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND (utmsource IS NULL OR NOT utmsource="facebook")

-- also returns 0, the NULL utmsource record is not included
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

问题:

  1. 有人可以解释这种行为吗?
  2. 我可以将设置更改为 检索查询2的结果而不添加额外的OR 我的查询功能? =&GT; not equals包含结果中的空值

1 个答案:

答案 0 :(得分:2)

您想要的是NULL - 安全平等(或不等)运算符。在ANSI SQL中,有一个名为is distinct from的运算符。 Hive似乎使用MySQL版本<=>。所以,你可以这样做:

SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

documentation

中描述了此运算符

我还应该指出,您可能会发现这是SELECT

的更简单的表述
SELECT (COUNT(*) - COUNT(utmsource)) as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

虽然总的来说,这似乎是最简单的:

SELECT COUNT(*)as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND utmsource IS NULL;

'Facebook'的比较是不必要的。