在reviews
表中,对于某些项目,我使用从0到100的评分系统获得以下数据
+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+
我正尝试创建bin大小为5的得分值的直方图。我的目标是为每个项目生成一个直方图。为了创建整个表格的直方图,可以使用width_bucket
。也可以将其调整为按项目进行操作:
SELECT item_id, g.n as bucket, COUNT(m.score) as count
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
但是,结果看起来像这样:
+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+
也就是说,不包括没有条目的垃圾箱。尽管我认为这不是一个坏解决方案,但我宁愿拥有所有存储桶,也可以将没有任何条目的存储桶设为0。更好的是,使用这种结构:
+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+
我更喜欢这种解决方案,因为它每项使用一行(而不是5n
),这样查询起来更简单,并且将内存消耗和数据传输成本降至最低。我目前的方法如下:
select item_id,
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
尽管此查询满足了我的要求,但我很想知道是否有一种更优雅的方法。如此多的case
语句不易阅读,更改bin条件可能需要更新每个和。另外,我对该查询可能存在的潜在性能问题感到好奇。
答案 0 :(得分:1)
第二个查询可以重写为使用ranges,以使编辑和编写查询更加容易:
with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 @> score) as bucket_1,
count(*) filter (where b2 @> score) as bucket_2,
count(*) filter (where b3 @> score) as bucket_3,
count(*) filter (where b4 @> score) as bucket_4,
count(*) filter (where b5 @> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;
用int4range(0,20)
构造的范围包括下限,但不包括上限。
名为buckets
的{{3}}仅创建一行,因此交叉联接不会更改review
表中的行数。
答案 1 :(得分:1)
我发现this帖子有用
CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
temp_histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM temp_histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
根据您的喜好使用存储桶编号(上述脚本中的100)。
这样调用
SELECT * FROM histogram($table_name_or_subquery, $column_name);
示例:
SELECT * FROM histogram('transactions_tbl', 'amount_colm');