BigQuery-从阵列中删除重复项

时间:2020-03-26 10:58:30

标签: sql google-bigquery

我要使用BigQuery,通过一个查询根据页面的标题对页面进行分组,并根据这些分组计算不同的指标。由于标题规则不是互相排斥的,所以我这样做是这样的:

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
    CASE WHEN (title LIKE '%game%') 
    THEN 'games_group' END, 
    CASE WHEN (title LIKE '%sport%') 
    THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group

这是结果:

views       ...   title_group
3414469869  ... 
4355264     ...   games_group
1361074     ...   sports_group

但是,不属于任何组的页面视图的数字3414469869是错误的。确实,当标题不包含“游戏”(或“运动”)时,我们得到UNNEST([null, "sports_group"](或UNNEST(["games_group", null])),因此我们仍然计算空组的观看次数。当标题既不包含“游戏”也不包含“体育”时,观看次数甚至被计数两次。

有没有办法从数组中删除重复项?

2 个答案:

答案 0 :(得分:2)

如何添加另一个组?

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
     UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END, 
             CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
             CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
            ]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
      title_group IS NOT NULL
GROUP BY title_group;

注意:这没有考虑NULL标题。我不知道这是否重要。

但是,我会用两列来表达这一点:

SELECT (title LIKE '%game%') as is_game,
       (title LIKE '%sport%') as is_sport,
       SUM(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
      title_group IS NOT NULL
GROUP BY is_game, is_sport;

这不会返回与您相同的行-游戏和体育活动分为两行。但是您可以看到组合。

编辑:

现在我想到了,您只想要一个LEFT JOIN

SELECT g.title_group, SUM(pv.views) as views, 
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
     (SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
      SELECT '%sport%', 'sports_group' as title_group UNION ALL
     ) g
     ON pv.title LIKE g.pattern
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
GROUP BY g.title_group;

答案 1 :(得分:0)

以下是BigQuery标准SQL

#standardSQL
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST(
    CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN 
      [
        CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
        CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
      ]
      ELSE ['other']
    END
) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
AND   title_group IS NOT NULL
GROUP BY title_group