Impala - 获取多个不同值的错误

时间:2015-10-07 10:42:45

标签: hadoop impala

我正在使用CDH-5.4.4 Cloudera Edition,我在HDFS位置有一个CSV文件,我的要求是在Hadoop Environement(OLTP)上执行实时SQL查询。

所以我决定使用Impala,我已经创建了MetaStore表到CSV文件,然后在impala编辑器中执行查询(在HUE应用程序中)。

当我执行以下查询时,我收到错误

  

“AnalysisException:所有DISTINCT聚合函数都需要具有   与count相同的参数集(DISTINCT City);偏离功能:   伯爵(DISTINCT国家)“。

CSV File

OrderID,CustomerID,City,Country
Ord01,Cust01,Aachen,Germany
Ord02,Cust01,Albuquerque,USA
Ord03,Cust01,Aachen,Germany
Ord04,Cust02,Arhus,Denmark
Ord05,Cust02,Arhus,Denmark

Problamatic Query

Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID

问题:

无法在查询中执行带有多个不同值的Impala查询..我在互联网上搜索它们提供NDV()方法作为解决方法,但NDV方法只返回不同值的近似计数,我需要Exact唯一计算多个字段。

期望:

对多个字段执行Exact唯一计数的最佳方法是什么?请修改上述查询以使用Impala。

注意:这不是我的原始表格,我已经为论坛问题复制了。

2 个答案:

答案 0 :(得分:2)

我在Impala中遇到了同样的问题。这是我的解决方法:

SELECT CustomerID
    ,sum(nr_of_cities)
    ,sum(nr_of_countries)
FROM (
    SELECT CustomerID
        ,Count(DISTINCT City) AS nr_of_cities
        ,0 AS nr_of_countries
    FROM CustomerOrders
    GROUP BY CustomerID

    UNION ALL

    SELECT CustomerID
        ,0 AS nr_of_cities
        ,Count(DISTINCT Country) AS nr_of_countries
    FROM CustomerOrders
    GROUP BY CustomerID
) AS aa
GROUP BY CustomerID

答案 1 :(得分:1)

我认为这可以做得更干净(未经测试):

WITH
countries AS
(
 SELECT CustomerID
       ,COUNT(DISTINCT City) AS nr_of_countries
 FROM CustomerOrders
 GROUP BY 1
)
,
cities AS
(
 SELECT CustomerID
       ,COUNT(DISTINCT City) AS nr_of_cities
 FROM CustomerOrders
 GROUP BY 1
)
SELECT CustomerID
      ,nr_of_cities
      ,nr_of_countries
 FROM cities INNER JOIN countries USING (CustomerID)