我有以下代码产生无法解释的值:
select keyword,
count(distinct foo) as foos,
count(distinct bar) as bars,
count(*) as events,
length(keyword) as kwd_len,
if(keyword rlike '[A-Z]', 1, 0) as has_uc,
if(keyword rlike '[a-z]', 1, 0) as has_lc,
if(keyword rlike '[\\W ]', 1, 0) as has_nw
from all_my_data
group by keyword
where has_uc > 1 or has_lc > 1 or has_nw > 1;
打印输出(重定向到文件)如下所示:
keyword foos bars events kwd_len has_uc has_lc has_nw
O Exterminador do Futuro - A S�rie Machinima� NULL NULL 1 1 14 81 1
NULL 1 1 3 21 0 1
credit offers NULL 1 1 1 77 0 1
muzikály NULL 1 3 3 9 0 1
NULL NULL 1 1 1 20 1
show NULL 1 1 3 5 0 1
(brooks , NULL 1 1 1 31 0 1
opera NULL 1 1 1 6 0 1
��ͯ NULL 1 1 1 7 0 0
l NULL NULL 1 1 1 5 0
methyl- NULL 1 1 2 24 0 1
Apartamento Planta baja престиж NULL 1 1 1 32 1 1
festivaly NULL 1 1 1 10 0 1
fiAEBEECACACACACACACACACACACAAA NULL 1 1 1 64 1 1
,B NULL 1 1 1 3 1 0
O Exterminador do Futuro - A S�rie Machinima� NULL NULL 1 1 7 87 1
concerts NULL 1 1 1 9 0 1
tanec NULL 1 3 5 6 0 1
Weekend Waiting Staff / Team Member ?€“ Central London ? NULL 1 1 1 77 1 1
ìùuL NULL 1 1 2 37 1 1
O Exterminador do Futuro - A S�rie Machinima� NULL NULL NULL NULL 1 1 2
MZ� �� � @ NULL NULL 1 1 2 59 1
error http://www.‰¼ NULL 1 1 1 50 0 1
2@ NULL 1 2 2 11 1 0
smc NULL NULL 1 1 2 18 0
[■ ▶ ▮▮] ® Mike Candys feat. David Deen - People Hold On (500K Exclusive Rework) ★ EXCLUSIVE! for club5485048 ★ [track at NULL 1 1 2 141 1 1
NULL 1 2 2 9 0 0
acting on a decision to buy marijuana illegally falls under the NULL 1 1 1 165 0 1
O Exterminador do Futuro - A S�rie Machinima� NULL NULL NULL 1 1 1 99
X NULL 1 1 2 65 1 1
�͠ NULL 1 1 1 5 0 0
NULL NULL 1 1 1 31 0
da???? �? NULL NULL 1 1 2 75 1
koncerty NULL 1 2 4 9 0 1
�͍ NULL NULL 1 1 1 15 1
O Exterminador do Futuro - A S�rie Machinima� NULL NULL NULL NULL 1 1 4
awk '{printf " NULL 1 1 1 38 0 1
ΒETA: NULL 1 1 2 93 1 0
�m NULL NULL 1 1 1 8 0
lsɬLjLj- Yahoo NULL 1 1 2 16 1 1
brave frontier %H2'2 NULL 1 1 1 23 1 1
smc NULL NULL 1 1 2 18 0
O Exterminador do Futuro - A S�rie Machinima DVDRip Rm� NULL 1 1 14 72 1 1
O Exterminador do Futuro - A S�rie Machinima� NULL NULL NULL 1 1 2 91
$A. ?8G .> NULL NULL 1 1 1 30 1
�͠ NULL 1 1 1 5 0 0
NULL 1 1 4 2 0 1
sport NULL 1 2 3 6 0 1
http://www.‰¼ NULL 1 1 1 44 0 1
ŠUN NULL 1 1 2 4 1 0
NULL 1 2 2 28 0 0
činohra NULL 1 3 7 8 0 1
Motorcraft MERCON NULL 1 1 6 21 1 1
O Exterminador do Futuro - A S�rie Machinima DVDRip Rm� NULL 1 1 7 76 1 1
NULL 1 1 1 8 1 1
� NULL 1 1 2 45 0 1
juegos frip NULL 1 1 1 13 0 1
NULL 1 1 1 11 0 1
我的问题是:
has_uc
和has_lc
的非[01]值来自何处?keyword
个字段(空Exterminador do Futuro
& c,通过直接检查/misc/hdfs/user/hive/warehouse/bad_keywords/000000_0
确认,尽管group by keyword
?NULL
和bars
的{{1}}值(foos
如何返回count
)?答案 0 :(得分:0)
我不知道HiveQL,但是从一般的SQL角度看,在第一个查询中你的“按关键字分组”看起来是隐含地将每个关键字的所有has_XX标志相加,当你想要的只是一个不同的列表时关键字及其属性。
尝试在第一个查询中删除“group by keyword”子句,而不是:
create table t1 as
select distinct keyword,
length(keyword) as kwd_len,
if(keyword rlike '[A-Z]', 1, 0) as has_uc,
if(keyword rlike '[a-z]', 1, 0) as has_lc,
if(keyword rlike '[\\W ]', 1, 0) as has_nw
from all_my_data