Hive - 在一列中计算不同的CSV

时间:2013-08-22 13:34:53

标签: csv hive

其中一个配置单元表看起来像这样:

 ID    listOfcategories
    1     ["a","b","b","a","c","d","d"]
    2     ["a","a","a","c","c","c","c","e","e","e"]
    3     ["a","b","c"]

逗号分隔值的数量是一个变量。我想查询每个行/ ID 中不同类别的数量 所以,我的输出应该如下:

ID     numDistCategories
1      4
2      3
3      3

1 个答案:

答案 0 :(得分:0)

您可以使用explodeoutput separate rows for each category,然后使用count distinct获取您要查找的结果。

像这样。

SELECT 
    id, 
    COUNT(DISTINCT(cat)) as numDistCategories
FROM (
    SELECT 
        id, 
        EXPLODE(listOfcategories) AS cat 
    FROM myTable) a
GROUP BY id;

希望有所帮助。