Question

我正在尝试编写一个查询，以便在每个@字符后拆分一列。然后，我希望能够在每个细分中对这些细分进行计数。

我设法在Hive中编写以下查询：

SELECT 
distinct split (msg_txt,'\\@')[0] AS first_msg, count(*)
FROM table1
;

但这不允许我添加一个组来获取计数。我尝试使用子查询执行此操作：

SELECT first_msg, count(*)
FROM (
SELECT 
distinct split (msg_txt,'\\@')[0] AS first_msg
FROM table1
)
GROUP BY first_msg
;

但是这给了我以下错误：

Error while compiling statement: FAILED: ParseException line 7:6 missing EOF at 'BY' near 'GROUP'

因此不确定如何编写此查询。

如果有人可以提出建议，我会非常感激。

提前致谢。

Answer 1

我认为你只需要一个表别名：

SELECT first_msg, count(*)
FROM (SELECT distinct split(msg_txt,'\\@')[0] AS first_msg
      FROM table1
     ) t
GROUP BY first_msg;

Hive需要表别名：

子查询必须具有名称，因为FROM中的每个表都是如此子句必须有一个名字。

在您的版本中，它将GROUP视为子查询的名称。 BY然后没有意义。

如上所述，这有点不合情理，因为你可以这样做：

SELECT distinct split(msg_txt,'\\@')[0] AS first_msg, 1 as cnt
FROM table1;

子查询中的distinct将确保所有值都是唯一的。我认为你的实际问题有点复杂。

Answer 2

根据您的要求，我不确定您为什么要获得第一个元素。忽略拆分的第一个元素的查询（考虑到你想在“@”之后为所有元素应用组）应该看起来像这样

select value, count(*) from (
select 
pos,value
from table1 lateral view posexplode(split (msg_txt,'\\@')) explodedcol as pos,value limit 10
) t where pos != 0 group by value
;

如果要包含所有按“@”分割的元素，只需从where子句中删除“post！= 0”条件。

此致

如何在蜂巢中结合分割和计数

2 个答案: