使用PostgreSQL 9.0。
假设我有一个包含字段的表:company
,profession
和year
。我想返回一个包含独特公司和专业的结果,但根据数字序列聚合(到一个数组中)很好:
示例表:
+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google | Programmer | 2000 |
| Google | Sales | 2000 |
| Google | Sales | 2001 |
| Google | Sales | 2002 |
| Google | Sales | 2004 |
| Mozilla | Sales | 2002 |
+-----------------------------+
我对一个输出类似于以下行的查询感兴趣:
+-----------------------------------------+
| company | profession | year |
+---------+------------+------------------+
| Google | Programmer | [2000] |
| Google | Sales | [2000,2001,2002] |
| Google | Sales | [2004] |
| Mozilla | Sales | [2002] |
+-----------------------------------------+
基本特征是只有连续年份才能组合在一起。
答案 0 :(得分:19)
识别非连续值总是有点棘手,涉及几个嵌套的子查询(至少我不能提出更好的解决方案)。
第一步是确定年份的非连续值:
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
返回以下结果:
company | profession | year | group_cnt ---------+------------+------+----------- Google | Programmer | 2000 | 1 Google | Sales | 2000 | 1 Google | Sales | 2001 | 0 Google | Sales | 2002 | 0 Google | Sales | 2004 | 1 Mozilla | Sales | 2002 | 1
现在使用group_cnt值,我们可以为连续年份的每个组创建“组ID”:
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
返回以下结果:
company | profession | year | group_nr ---------+------------+------+---------- Google | Programmer | 2000 | 1 Google | Sales | 2000 | 2 Google | Sales | 2001 | 2 Google | Sales | 2002 | 2 Google | Sales | 2004 | 3 Mozilla | Sales | 2002 | 4 (6 rows)
正如您所看到的,每个“group”都有自己的group_nr,我们最终可以通过添加另一个派生表来聚合:
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
返回以下结果:
company | profession | years ---------+------------+------------------ Google | Programmer | {2000} Google | Sales | {2000,2001,2002} Google | Sales | {2004} Mozilla | Sales | {2002} (4 rows)
如果我没弄错的话,这正是你想要的。
答案 1 :(得分:11)
@a_horse_with_no_name's answer有很多价值,作为一个正确的解决方案,就像我在评论中已经说过的那样,是学习如何在PostgreSQL中使用不同类型的窗口函数的好材料。
然而我不禁感到在这个答案中采取的方法对于像这样的问题来说有点过分了。基本上,在进行数组聚合年之前,您需要的是一个额外的分组标准。您已经拥有company
和profession
,现在您只需要一些东西来区分属于不同序列的年份。
这就是上面提到的答案所提供的内容,而这正是我认为可以用更简单的方式完成的事情。方法如下:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID
答案 2 :(得分:4)
对于具有聚合/窗口函数的纯SQL,问题相当笨拙。虽然循环通常比使用普通SQL的基于集合的解决方案慢,但是plpgsql的过程解决方案可以在表上执行单个顺序扫描(FOR
循环的隐式游标)和在这种特殊情况下, 应该快得多 :
测试表:
CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
('Google', 'Programmer', 2000)
,('Google', 'Sales', 2000)
,('Google', 'Sales', 2001)
,('Google', 'Sales', 2002)
,('Google', 'Sales', 2004)
,('Mozilla', 'Sales', 2002);
功能:
CREATE OR REPLACE FUNCTION f_periods()
RETURNS TABLE (company text, profession text, years int[]) AS
$func$
DECLARE
r tbl; -- use table type as row variable
r0 tbl;
BEGIN
FOR r IN
SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
IF ( r.company, r.profession, r.year)
<> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row
RETURN QUERY
SELECT r0.company, r0.profession, years; -- output row
years := ARRAY[r.year]; -- start new array
ELSE
years := years || r.year; -- add to array - year can be NULL, too
END IF;
r0 := r; -- remember last row
END LOOP;
RETURN QUERY -- output last iteration
SELECT r0.company, r0.profession, years;
END
$func$ LANGUAGE plpgsql;
呼叫:
SELECT * FROM f_periods();
生成请求的结果。