COUNT中的逻辑检查和性能问题(DISTINCT foo)

时间:2014-02-27 17:25:40

标签: google-bigquery

我需要运行一个常规且非常昂贵的查询,不幸的是我必须使用几乎完全相同的查询来加入该查询的结果才能获得比率...导致使用查询接管运行3分钟。这就是我想做的事情....(假设避免JOIN会加快查询时间)

SELECT
    date,
    meal,
    country,
    COUNT(DISTINCT person, WHERE UPPER(ingredient) CONTAINS "SUN BUTTER", 10000000) as total_sunbutter_meals_per_day
    COUNT(DISTINCT person, 10000000) as total_meals
    ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM [project:dataset.menu]
GROUP BY date, meals, country

这是我被迫做的事情....

SELECT
    total.date as date,
    total.meal as meal,
    total.country as country,
    total_sunbutter_meals_per_day,
    total_meals_per_day,
    ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM
    (    
    SELECT
        date,
        meal,
        country,
        COUNT(DISTINCT person, 100000) as total_sunbutter_meals_per_day
    FROM [project:dataset.menu]
    WHERE    
        UPPER(ingredient) CONTAINS "SUN BUTTER"
    GROUP BY date, meals, country 
    ) as sunbutter
JOIN
    (
    SELECT
        date,
        meal,
        country,
        COUNT(DISTINCT person, 100000) as total_meals_per_day
    FROM [project:dataset.menu]
    GROUP BY date, meals, country 
    ) as total
ON total.date = sunbutter.date AND total.meal = sunbutter.meal AND total.country = sunbutter.country

三个问题/问题:

  1. 似乎应该有一种方法Big Query可以使用一些嵌入式条件逻辑执行COUNT(DISTINCT字段)。有没有办法避免在上面的场景中进行连接?
  2. 值大于100,000的COUNT DISTINCT对我来说失败了。我希望能够做到10,000,000的COUNT DISTINCT。 COUNT DISTINCT和大值是否存在已知的性能问题?这是否得到解决?
  3. 是否有计划在SELECT中的另一个语句中使用SELECT中的声明/计算字段名称?在上面的示例中,我想使用结果的名称而不是在ROUND语句中重复公式。 (即我想指定

    total_sunbutter_meals_per_day / total_meals 而不是

    COUNT(DISTINCT人,WHERE UPPER(成分)包含“SUN BUTTER”,100000)/ COUNT(DISTINCT人,10000000)

  4. 提前感谢您的帮助!

1 个答案:

答案 0 :(得分:2)

问题1:

您可以使用以下两个不同的字段创建内部查询:

SELECT
  date,
  meal,
  country,
  COUNT(DISTINCT person) total_meals,
  COUNT(DISTINCT sunbutter_person) total_sunbutter_meals,
FROM
  (SELECT
     date,
     meal,
     country,
     person,
     IF(UPPER(ingredient) CONTAINS "SUN BUTTER", person, NULL) sunbutter_person
   FROM [project:dataset.menu])

问题2:

在BigQuery中,COUNT(DISTINCT)返回近似结果。如果增加返回精确结果的阈值,则会损害性能(并最终导致查询失败),因为单个工作人员需要跟踪所有这些不同的值。有关详细信息,请参阅BigQuery COUNT(DISTINCT value) vs COUNT(value)

如果您对精确结果的需求超出了COUNT(DISTINCT)的可伸缩性,那么另一种方法是使用GROUP EACH BY和COUNT(*),这将以可扩展的方式为您提供不同元素的精确计数。

请注意,您需要以稍微不同的方式解决问题1中的问题。类似的东西:

SELECT
  date,
  meal,
  country,
  COUNT(*) total_meals,
  SUM(sunbutter) total_sunbutter_meals,
FROM
  (SELECT
     date,
     meal,
     country,
     IF(UPPER(ingredient) CONTAINS "SUN BUTTER", 1, 0) sunbutter,
   FROM [project:dataset.menu]
   GROUP EACH BY date, meal, country, person)
GROUP BY date, meal, country

问题3:

目前,您无法引用同一SELECT语句中的其他字段,我们还没有计划添加该功能。但是您始终可以将查询包装在另一个查询中。

而不是:

SELECT 17 AS a, a + 1 AS b

你可以写:

SELECT a, a + 1 AS b FROM (SELECT 17 AS a)