使用Hive进行字数统计

时间:2016-12-08 20:41:21

标签: count hive hdfs hql

假设我有一个包含列id和内容的表:

id | content
________________________
1  | abc abr abc as abs
2  | abc arc cre arc
3  | agr ann agd agd agd 

我想要的是这样输出:

{"abc":2,"abr":1,"as":1, "abs":1}  # for id 1
{"abc":1,"arc":2,"cre":1}          # for id 2
{"agr":1,"agd":3,"ann":1}          # for id 3

如何使用Hive完成任务?

1 个答案:

答案 0 :(得分:1)

您需要this库。构建非常简单。

<强>查询

ADD JAR /path/to/jar/brickhouse-0.7.1.jar;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';

SELECT id
  , COLLECT(words, c) AS count_map
FROM (
  SELECT id
    , words
    , COUNT(*) AS c
  FROM (
    SELECT id, words
    FROM db.tbl
    LATERAL VIEW EXPLODE(SPLIT(content, ' ')) exptbl AS words ) x
  GROUP BY id, words ) y
GROUP BY id

<强>输出

+----+---------------------------------+
|id  |count_map                        |
+----+---------------------------------+
|1   |{"as":1,"abs":1,"abc":2,"abr":1} |
+----+---------------------------------+
|2   |{"cre":1,"arc":2,"abc":1}        |
+----+---------------------------------+
|3   |{"ann":1,"agr":1,"agd":3}        |
+----+---------------------------------+