如何从(...,1,0)获得除NULL / 0/1之外的任何东西?

时间:2014-02-26 19:17:25

标签: sql hive hiveql

我有以下代码产生无法解释的值:

select keyword,
       count(distinct foo) as foos,
       count(distinct bar) as bars,
       count(*) as events,
       length(keyword) as kwd_len,
       if(keyword rlike '[A-Z]', 1, 0) as has_uc,
       if(keyword rlike '[a-z]', 1, 0) as has_lc,
       if(keyword rlike '[\\W ]', 1, 0) as has_nw
from all_my_data
group by keyword
where has_uc > 1 or has_lc > 1 or has_nw > 1;

打印输出(重定向到文件)如下所示:

keyword foos    bars    events  kwd_len has_uc  has_lc  has_nw
O Exterminador do Futuro - A S�rie Machinima�   NULL    NULL    1   1   14  81  1
        NULL    1   1   3   21  0   1
credit offers   NULL    1   1   1   77  0   1
muzikály    NULL    1   3   3   9   0   1
        NULL    NULL    1   1   1   20  1
show    NULL    1   1   3   5   0   1
(brooks ,   NULL    1   1   1   31  0   1
opera   NULL    1   1   1   6   0   1
��ͯ NULL    1   1   1   7   0   0
l   NULL    NULL    1   1   1   5   0
methyl- NULL    1   1   2   24  0   1
Apartamento Planta baja престиж NULL    1   1   1   32  1   1
festivaly   NULL    1   1   1   10  0   1
fiAEBEECACACACACACACACACACACAAA     NULL    1   1   1   64  1   1
,B  NULL    1   1   1   3   1   0
O Exterminador do Futuro - A S�rie Machinima�   NULL    NULL    1   1   7   87  1
concerts    NULL    1   1   1   9   0   1
tanec   NULL    1   3   5   6   0   1
Weekend Waiting Staff / Team Member ?€“ Central London ?    NULL    1   1   1   77  1   1
ìùuL    NULL    1   1   2   37  1   1
O Exterminador do Futuro - A S�rie Machinima�   NULL    NULL    NULL    NULL    1   1   2
MZ�   �� � @    NULL    NULL    1   1   2   59  1
error http://www.‰¼ NULL    1   1   1   50  0   1
2@  NULL    1   2   2   11  1   0
smc     NULL    NULL    1   1   2   18  0
[■ ▶ ▮▮] ® Mike Candys feat. David Deen - People Hold On (500K Exclusive Rework) ★ EXCLUSIVE! for club5485048 ★ [track at   NULL    1   1   2   141 1   1
        NULL    1   2   2   9   0   0
acting on a decision to buy marijuana illegally falls under the     NULL    1   1   1   165 0   1
O Exterminador do Futuro - A S�rie Machinima�   NULL    NULL    NULL    1   1   1   99
X   NULL    1   1   2   65  1   1
�͠  NULL    1   1   1   5   0   0
        NULL    NULL    1   1   1   31  0
da???? �?   NULL    NULL    1   1   2   75  1
koncerty    NULL    1   2   4   9   0   1
�͍  NULL    NULL    1   1   1   15  1
O Exterminador do Futuro - A S�rie Machinima�   NULL    NULL    NULL    NULL    1   1   4
awk '{printf "  NULL    1   1   1   38  0   1
ΒETA:   NULL    1   1   2   93  1   0
�m  NULL    NULL    1   1   1   8   0
lsɬLjLj- Yahoo    NULL    1   1   2   16  1   1
brave frontier %H2'2    NULL    1   1   1   23  1   1
smc     NULL    NULL    1   1   2   18  0
O Exterminador do Futuro - A S�rie Machinima DVDRip Rm� NULL    1   1   14  72  1   1
O Exterminador do Futuro - A S�rie Machinima�   NULL    NULL    NULL    1   1   2   91
$A. ?8G .>  NULL    NULL    1   1   1   30  1
�͠  NULL    1   1   1   5   0   0
        NULL    1   1   4   2   0   1
sport   NULL    1   2   3   6   0   1
http://www.‰¼   NULL    1   1   1   44  0   1
ŠUN NULL    1   1   2   4   1   0
        NULL    1   2   2   28  0   0
činohra NULL    1   3   7   8   0   1
Motorcraft MERCON   NULL    1   1   6   21  1   1
O Exterminador do Futuro - A S�rie Machinima DVDRip Rm� NULL    1   1   7   76  1   1
        NULL    1   1   1   8   1   1
�   NULL    1   1   2   45  0   1
juegos frip     NULL    1   1   1   13  0   1
        NULL    1   1   1   11  0   1

我的问题是:

  1. has_uchas_lc的非[01]值来自何处?
  2. 为什么我重复keyword个字段(空Exterminador do Futuro& c,通过直接检查/misc/hdfs/user/hive/warehouse/bad_keywords/000000_0确认,尽管group by keyword
  3. 为什么我NULLbars的{​​{1}}值(foos如何返回count)?

1 个答案:

答案 0 :(得分:0)

我不知道HiveQL,但是从一般的SQL角度看,在第一个查询中你的“按关键字分组”看起来是隐含地将每个关键字的所有has_XX标志相加,当你想要的只是一个不同的列表时关键字及其属性。

尝试在第一个查询中删除“group by keyword”子句,而不是:

create table t1 as
select distinct keyword, 
       length(keyword) as kwd_len,
       if(keyword rlike '[A-Z]', 1, 0) as has_uc,
       if(keyword rlike '[a-z]', 1, 0) as has_lc,
       if(keyword rlike '[\\W ]', 1, 0) as has_nw
from all_my_data