我要使用BigQuery,通过一个查询根据页面的标题对页面进行分组,并根据这些分组计算不同的指标。由于标题规则不是互相排斥的,所以我这样做是这样的:
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
CASE WHEN (title LIKE '%game%')
THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%')
THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
这是结果:
views ... title_group
3414469869 ...
4355264 ... games_group
1361074 ... sports_group
但是,不属于任何组的页面视图的数字3414469869是错误的。确实,当标题不包含“游戏”(或“运动”)时,我们得到UNNEST([null, "sports_group"]
(或UNNEST(["games_group", null])
),因此我们仍然计算空组的观看次数。当标题既不包含“游戏”也不包含“体育”时,观看次数甚至被计数两次。
有没有办法从数组中删除重复项?
答案 0 :(得分:2)
如何添加另一个组?
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END,
CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
title_group IS NOT NULL
GROUP BY title_group;
注意:这没有考虑NULL
标题。我不知道这是否重要。
但是,我会用两列来表达这一点:
SELECT (title LIKE '%game%') as is_game,
(title LIKE '%sport%') as is_sport,
SUM(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
title_group IS NOT NULL
GROUP BY is_game, is_sport;
这不会返回与您相同的行-游戏和体育活动分为两行。但是您可以看到组合。
编辑:
现在我想到了,您只想要一个LEFT JOIN
:
SELECT g.title_group, SUM(pv.views) as views,
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
(SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
SELECT '%sport%', 'sports_group' as title_group UNION ALL
) g
ON pv.title LIKE g.pattern
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
GROUP BY g.title_group;
答案 1 :(得分:0)
以下是BigQuery标准SQL
#standardSQL
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST(
CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN
[
CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
]
ELSE ['other']
END
) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
AND title_group IS NOT NULL
GROUP BY title_group