我一直在为大学课程工作整理一些数据,我正在寻找优化查询的方法。
我使用的数据集是英国国家警察停止和搜索的数据,我正在尝试获取种族与停止和搜索所占份额之间的相关性。
我有一个查询,它将针对每个警察部队和种族组合找到搜索总数,同一支部队在该种族上与其他种族相比的搜索百分比,全国平均百分比以及该部队平均值与全国平均水平(我知道这很令人困惑)。
这是我当前的“有效”查询:
SELECT c1.FORCE,
c1.ETHNICITY,
(SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE AND ETHNICITY = c1.ETHNICITY) AS num_searches,
(ROUND(((SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE AND ETHNICITY = c1.ETHNICITY) /
(SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE)::DECIMAL), 4) * 100) AS percentage_of_force,
(SELECT ROUND((COUNT(*) / 303565::DECIMAL) * 100, 4) FROM CRIMES WHERE ETHNICITY = c1.ETHNICITY GROUP BY ETHNICITY) AS national_average,
(SELECT (ROUND(((SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE AND ETHNICITY = c1.ETHNICITY) /
(SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE)::DECIMAL), 4) * 100) - (SELECT ROUND((COUNT(*) / 303565::DECIMAL) * 100, 4) FROM CRIMES WHERE ETHNICITY = c1.ETHNICITY GROUP BY ETHNICITY)) AS difference_from_average
FROM (SELECT * FROM CRIMES) AS c1
GROUP BY c1.FORCE, c1.ETHNICITY
ORDER BY c1.FORCE, c1.ETHNICITY;
所以我要解决的问题是围绕多次在“ SELECT”部分重复使用同一查询。
从上面的查询中可以看到,difference_from_average
只是percentage_of_force
减去national_average
的结果,但是我似乎无法找出一种一次性计算这些值的方法,然后在SELECT
部分的其他地方重用它们。所以我的问题是如何实现呢?
其他信息
示例输入数据
| date | ethnicity | force |
|------------|-----------|-----------------|
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | west-yorkshire |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | north-yorkshire |
| 2018-01-01 | White | west-yorkshire |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Undefined | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | norfolk |
| 2018-01-01 | White | north-yorkshire |
| 2018-01-01 | White | northumbria |
| 2018-01-01 | White | west-yorkshire |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | Black | metropolitan |
示例查询结果
| force | ethnicity | num_searches | percentage_of_force | national_average | difference_from_average |
|-------------------|-----------|--------------|---------------------|------------------|-------------------------|
| avon-and-somerset | Asian | 41 | 2.88 | 13.0641 | -10.1841 |
| avon-and-somerset | Black | 223 | 15.64 | 25.6798 | -10.0398 |
| avon-and-somerset | Other | 66 | 4.63 | 2.7368 | 1.8932 |
| avon-and-somerset | Undefined | 184 | 12.9 | 7.4699 | 5.4301 |
| avon-and-somerset | White | 912 | 63.96 | 50.941 | 13.019 |
| bedfordshire | Asian | 440 | 23.31 | 13.0641 | 10.2459 |
| bedfordshire | Black | 373 | 19.76 | 25.6798 | -5.9198 |
| bedfordshire | Mixed | 2 | 0.11 | 0.1084 | 0.0016 |
| bedfordshire | Other | 33 | 1.75 | 2.7368 | -0.9868 |
| bedfordshire | Undefined | 97 | 5.14 | 7.4699 | -2.3299 |
| bedfordshire | White | 943 | 49.95 | 50.941 | -0.991 |
| btp | Asian | 301 | 7.14 | 13.0641 | -5.9241 |
| btp | Black | 1274 | 30.23 | 25.6798 | 4.5502 |
| btp | Other | 71 | 1.68 | 2.7368 | -1.0568 |
| btp | Undefined | 48 | 1.14 | 7.4699 | -6.3299 |
| btp | White | 2521 | 59.81 | 50.941 | 8.869 |
我正在使用PostgreSQL v11.2。
答案 0 :(得分:1)
有多种简化查询的方法。您可以使用一系列CTE来针对不同级别的聚合预先计算结果。但是我认为最有效,最易读的方法是使用窗口函数。
可以使用带有各种COUNT(...) OVER(...)
选项的PARTITION BY
在子查询中计算所有中间计数,如下所示:
SELECT
force,
ethnicity,
COUNT(*) OVER(PARTITION BY force, ethnicity) AS cnt,
COUNT(*) OVER(PARTITION BY force) AS cnt_force,
COUNT(*) OVER(PARTITION BY ethnicity) AS cnt_ethnicity,
ROW_NUMBER() OVER(PARTITION BY force, ethnicity) AS rn
FROM crimes
然后,外部查询可以计算最终结果(同时在每个force
/ ethnicity
元组中的第一条记录上进行过滤,以避免重复)。
查询:
SELECT
force,
ethnicity,
cnt AS num_searches,
ROUND(cnt / cnt_force::decimal * 100, 4) AS percentage_of_force,
ROUND(cnt_ethnicity / 303565::decimal * 100, 4) AS national_average,
ROUND(cnt / cnt_force::decimal * 100, 4)
- ROUND(cnt_ethnicity / 303565::decimal * 100, 4) AS difference_from_average
FROM (
SELECT
force,
ethnicity,
COUNT(*) OVER(PARTITION BY force, ethnicity) AS cnt,
COUNT(*) OVER(PARTITION BY force) AS cnt_force,
COUNT(*) OVER(PARTITION BY ethnicity) AS cnt_ethnicity,
ROW_NUMBER() OVER(PARTITION BY force, ethnicity) AS rn
FROM crimes
) x
WHERE rn = 1
ORDER BY force, ethnicity;
| force | ethnicity | num_searches | percentage_of_force | national_average | difference_from_average |
| --------------- | --------- | ------------ | ------------------- | ---------------- | ----------------------- |
| metropolitan | Black | 6 | 46.1538 | 0.0020 | 46.1518 |
| metropolitan | Undefined | 1 | 7.6923 | 0.0003 | 7.6920 |
| metropolitan | White | 6 | 46.1538 | 0.0043 | 46.1495 |
| norfolk | White | 1 | 100.0000 | 0.0043 | 99.9957 |
| north-yorkshire | White | 2 | 100.0000 | 0.0043 | 99.9957 |
| northumbria | White | 1 | 100.0000 | 0.0043 | 99.9957 |
| west-yorkshire | White | 3 | 100.0000 | 0.0043 | 99.9957 |
答案 1 :(得分:0)
诀窍是使用子选择:
SELECT f(a, b), a, c
FROM (SELECT g(c, d) AS a,
h(c) AS b,
c, d
FROM x) AS q;
您明白了。