我正在使用CDH-5.4.4 Cloudera Edition,我在HDFS位置有一个CSV文件,我的要求是在Hadoop Environement(OLTP)上执行实时SQL查询。
所以我决定使用Impala,我已经创建了MetaStore表到CSV文件,然后在impala编辑器中执行查询(在HUE应用程序中)。
当我执行以下查询时,我收到错误
“AnalysisException:所有DISTINCT聚合函数都需要具有 与count相同的参数集(DISTINCT City);偏离功能: 伯爵(DISTINCT国家)“。
CSV File
OrderID,CustomerID,City,Country
Ord01,Cust01,Aachen,Germany
Ord02,Cust01,Albuquerque,USA
Ord03,Cust01,Aachen,Germany
Ord04,Cust02,Arhus,Denmark
Ord05,Cust02,Arhus,Denmark
Problamatic Query
Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID
问题:
无法在查询中执行带有多个不同值的Impala查询..我在互联网上搜索它们提供NDV()方法作为解决方法,但NDV方法只返回不同值的近似计数,我需要Exact唯一计算多个字段。
期望:
对多个字段执行Exact唯一计数的最佳方法是什么?请修改上述查询以使用Impala。
注意:这不是我的原始表格,我已经为论坛问题复制了。
答案 0 :(得分:2)
我在Impala中遇到了同样的问题。这是我的解决方法:
SELECT CustomerID
,sum(nr_of_cities)
,sum(nr_of_countries)
FROM (
SELECT CustomerID
,Count(DISTINCT City) AS nr_of_cities
,0 AS nr_of_countries
FROM CustomerOrders
GROUP BY CustomerID
UNION ALL
SELECT CustomerID
,0 AS nr_of_cities
,Count(DISTINCT Country) AS nr_of_countries
FROM CustomerOrders
GROUP BY CustomerID
) AS aa
GROUP BY CustomerID
答案 1 :(得分:1)
我认为这可以做得更干净(未经测试):
WITH
countries AS
(
SELECT CustomerID
,COUNT(DISTINCT City) AS nr_of_countries
FROM CustomerOrders
GROUP BY 1
)
,
cities AS
(
SELECT CustomerID
,COUNT(DISTINCT City) AS nr_of_cities
FROM CustomerOrders
GROUP BY 1
)
SELECT CustomerID
,nr_of_cities
,nr_of_countries
FROM cities INNER JOIN countries USING (CustomerID)