从配置单元中的列中选择字符串的某些部分并获取计数

时间:2014-05-06 19:17:58

标签: hive hiveql

我有一个列'平台',其行如下所示。

NAME3: “字符串1 /字符串2 / STRING3 / S:1.2.1 / ABCD / XYZ”

我有另一个名为'name2'的列名。我的表看起来像这样

' id       |    name2     |   name3
-----------+--------------+---------------------
 1         |      x1      | string1/string2/string3/s:1.2.1/ABCD/XYZ
 2         |      x1      | string1/string2/string3/S:2.2.1/ABCD/XYZ
 3         |      x2      | string5/string4/string3/s:1.1/ABCD/XYZ
 4         |      x3      | string1/string6/string7/m:0.2.2/ABCD/XYZ
 5         |      x2      | string1/string2/string3/S:2.2.0/ABCD/XYZ'

我想根据平台的子串获得事件的数量。像

'name3     | X1    |   X2 |    X3    |

string4        |       |   1  |          |
string6        |       |      |   1      |'

或者如果我想根据'android'或'iOS'获得计数,我该怎么做?

'name3     | X1    |   X2 |    X3    |

 string4          |       |   1  |          |
 string1      |   2   |   1  |   1      |'

我用于计数的查询如下。它可以很好地获取事件的数量,但无法弄清楚如何根据子字符串获取计数。

'select name2,
    count(1) AS total
from table1 where name2='x1' OR name2='x2' OR name2='x3'
group by name2;'

有什么建议吗?

2 个答案:

答案 0 :(得分:0)

希望这会有所帮助......

查询:

 select a.platforms, a.event, count(1) as count from 
 (select regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$',1) 
 as platforms, event from table1) a group by a.platforms, a.event;

输出:

platforms       event   count
android         x1      2
android         x2      1    
android         x3      1
ios             x2      1

答案 1 :(得分:0)

首先,我会将该字符串拆分为具有实际列的视图。类似的东西:

create view my_view as select
id,
event,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 1) as os,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 2) as brand,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 3) as model,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 4) as lte,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 5) as abcd,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 6) as user,
regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 7) as xyz
from my_table;

然后查询该视图更容易。您也可以将此作为内部查询。这种“count-where”查询在不同列中需要不同的计数是一种非常常见的用法。我知道这样做的最好方法是使用模式:

sum(if( [condition] , 1, 0))

因此,对于您的示例,它将是:

select os,
sum(if(event = 'x1', 1, 0)) as x1,
sum(if(event = 'x2', 1, 0)) as x2,
sum(if(event = 'x3', 1, 0)) as x3
from my_view
group by os;

或者:

select brand,
sum(if(event = 'x1', 1, 0)) as x1,
sum(if(event = 'x2', 1, 0)) as x2,
sum(if(event = 'x3', 1, 0)) as x3
from my_view
group by brand;

这是上面的查询,但是使用该视图作为内部查询而不是实际视图:

select brand,
sum(if(event = 'x1', 1, 0)) as x1,
sum(if(event = 'x2', 1, 0)) as x2,
sum(if(event = 'x3', 1, 0)) as x3
from (
  select
  id,
  event,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 1) as os,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 2) as brand,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 3) as model,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 4) as lte,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 5) as abcd,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 6) as user,
  regexp_extract(platform, '^(.*)/(.*)/(.*)/(.*)/(.*)/(.*)/(.*)$', 7) as xyz
  from my_table
) t
group by brand;