生成按列分组的值的直方图

时间:2018-07-17 14:55:16

标签: postgresql histogram

reviews表中,对于某些项目,我使用从0到100的评分系统获得以下数据

+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1         | 1       | 90    |
+-----------+---------+-------+
| 2         | 1       | 40    |
+-----------+---------+-------+
| 3         | 1       | 10    |
+-----------+---------+-------+
| 4         | 2       | 90    |
+-----------+---------+-------+
| 5         | 2       | 90    |
+-----------+---------+-------+
| 6         | 2       | 70    |
+-----------+---------+-------+
| 7         | 3       | 80    |
+-----------+---------+-------+
| 8         | 3       | 80    |
+-----------+---------+-------+
| 9         | 3       | 80    |
+-----------+---------+-------+
| 10        | 3       | 80    |
+-----------+---------+-------+
| 11        | 4       | 10    |
+-----------+---------+-------+
| 12        | 4       | 30    |
+-----------+---------+-------+
| 13        | 4       | 50    |
+-----------+---------+-------+
| 14        | 4       | 80    |
+-----------+---------+-------+

我正尝试创建bin大小为5的得分值的直方图。我的目标是为每个项目生成一个直方图。为了创建整个表格的直方图,可以使用width_bucket。也可以将其调整为按项目进行操作:

SELECT item_id, g.n as bucket, COUNT(m.score) as count 
FROM generate_series(1, 5) g(n) LEFT JOIN
     review as m
     ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;

但是,结果看起来像这样:

+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1       | 5      | 1     |
+---------+--------+-------+
| 1       | 3      | 1     |
+---------+--------+-------+
| 1       | 1      | 1     |
+---------+--------+-------+
| 2       | 5      | 2     |
+---------+--------+-------+
| 2       | 4      | 2     |
+---------+--------+-------+
| 3       | 4      | 4     |
+---------+--------+-------+
| 4       | 1      | 1     |
+---------+--------+-------+
| 4       | 2      | 1     |
+---------+--------+-------+
| 4       | 3      | 1     |
+---------+--------+-------+
| 4       | 4      | 1     |
+---------+--------+-------+

也就是说,不包括没有条目的垃圾箱。尽管我认为这不是一个坏解决方案,但我宁愿拥有所有存储桶,也可以将没有任何条目的存储桶设为0。更好的是,使用这种结构:

+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1       | 1        | 0        | 1        | 0        | 1        |
+---------+----------+----------+----------+----------+----------+
| 2       | 0        | 0        | 0        | 2        | 2        |
+---------+----------+----------+----------+----------+----------+
| 3       | 0        | 0        | 0        | 4        | 0        |
+---------+----------+----------+----------+----------+----------+
| 4       | 1        | 1        | 1        | 1        | 0        |
+---------+----------+----------+----------+----------+----------+

我更喜欢这种解决方案,因为它每项使用一行(而不是5n),这样查询起来更简单,并且将内存消耗和数据传输成本降至最低。我目前的方法如下:

select item_id, 
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;

尽管此查询满足了我的要求,但我很想知道是否有一种更优雅的方法。如此多的case语句不易阅读,更改bin条件可能需要更新每个和。另外,我对该查询可能存在的潜在性能问题感到好奇。

2 个答案:

答案 0 :(得分:1)

第二个查询可以重写为使用ranges,以使编辑和编写查询更加容易:

with buckets (b1, b2, b3, b4, b5) as (
  values ( 
     int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100) 
  )
)
select item_id, 
       count(*) filter (where b1 @> score) as bucket_1,
       count(*) filter (where b2 @> score) as bucket_2,
       count(*) filter (where b3 @> score) as bucket_3,
       count(*) filter (where b4 @> score) as bucket_4,
       count(*) filter (where b5 @> score) as bucket_5
from review 
  cross join buckets
group by item_id
order by item_id;

int4range(0,20)构造的范围包括下限,但不包括上限。

名为buckets的{​​{3}}仅创建一行,因此交叉联接不会更改review表中的行数。

答案 1 :(得分:1)

我发现this帖子有用

CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
  WITH
  source AS (
    SELECT * FROM %s
  ),
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM source
  ),
  temp_histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM source, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM temp_histogram',
  table_name_or_subquery,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name
  );
END
$func$ LANGUAGE plpgsql;

根据您的喜好使用存储桶编号(上述脚本中的100)。

这样调用

SELECT * FROM histogram($table_name_or_subquery, $column_name);

示例:     SELECT * FROM histogram('transactions_tbl', 'amount_colm');