使用下面的表table1
+--------+-------+-------+------------+-------+
| flight | orig | dest | passenger | bags |
+--------+-------+-------+------------+-------+
| 1111 | sfo | chi | david | 3 |
| 1112 | sfo | dal | david | 7 |
| 1112 | sfo | dal | kim | 10|
| 1113 | lax | san | ameera | 5 |
| 1114 | lax | lfr | tim | 6 |
| 1114 | lax | lfr | jake | 8 |
+--------+-------+-------+------------+-------+
我按orig
汇总表格,如下所示
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig
我需要为每个passenger
群组添加最长名称length(passenger)
(orig
) - 我该如何处理?
预期输出
+------+-------------+-----------+---------------+-------------------+
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo | 3 | 2 | 7 | david |
| lax | 3 | 3 | 6 | ameera |
+------+-------------+-----------+---------------+-------------------+
答案 0 :(得分:5)
您可以使用DISTINCT ON
方便地检索每组名称最长的乘客。
但我认为没有办法在单SELECT
中将原始查询(或任何其他简单方法)与原始查询结合起来。我建议加入两个独立的子查询:
SELECT *
FROM ( -- your original query
SELECT orig
, count(*) AS flight_cnt
, count(distinct passenger) AS pass_cnt
, percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
FROM table1
GROUP BY orig
) org_query
JOIN ( -- my addition
SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger) DESC NULLS LAST
) pas USING (orig);
join子句中的 USING
方便地只输出orig
的一个实例,因此您只需在外部SELECT *
中使用SELECT
。
如果passenger
可以为NULL,则添加NULLS LAST
:
来自同一组中具有相同最大长度的多个乘客名称,您将获得任意选择 - 除非您向ORDER BY
添加更多表达式作为决胜局。以上链接中的详细解释。
通常,单次扫描更为出色,尤其是顺序扫描。
上述查询使用两个扫描(可能是索引/索引扫描)。但是第二次扫描比较便宜,除非桌子太大而不适合缓存(大多数情况下)。 Lukas suggested an alternative query with only a single SELECT
添加:
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] -- I'd add NULLS LAST
这个想法非常明智,但last time I tested,array_agg
与ORDER BY
表现不佳。 (每组ORDER BY
的开销很大,阵列处理也很昂贵。)
使用自定义聚合函数 first()
like instructed in the Postgres Wiki here,同样的方法可能更便宜。或者,更快,但a version written in C, available on PGXN。消除了阵列处理的额外成本,但我们仍需要每组ORDER BY
。只有少数群体可能更快。然后你会添加:
, first(passenger ORDER BY length(passenger) DESC NULLS LAST)
Gordon和Lukas也提到了窗口函数first_value()
。在聚合函数之后应用窗口函数。要在同一个SELECT
中使用它,我们需要首先汇总passenger
以某种方式 - 捕获22.Gordon使用子查询来解决这个问题 - 另一个候选人可以使用标准Postgres获得良好的性能
first()
在没有子查询的情况下做同样的事情,应该更简单,更快一些。但对于大多数情况下,每组只有少量行,它仍然比单独的DISTINCT ON
更快。对于每组很多行,递归CTE技术通常更快。如果您有一个包含所有相关的唯一orig
值的单独表格,那么还有更快的技术。详细说明:
最佳解决方案取决于各种因素。布丁的证据就在于吃。要优化性能,您必须使用您的设置进行测试。上述查询应该是最快的。
答案 1 :(得分:2)
一种方法使用窗口函数first_value()
。不幸的是,这不能用作聚合函数:
select orig,
count(*) flight_cnt,
count(distinct passenger) as pass_cnt,
percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
max(longest_name) as longest_name
from (select t1.*,
first_value(name) over (partition by orig order by length(name) desc) as longest_name
from table1
) t1
group by orig;
答案 2 :(得分:1)
您正在寻找类似Oracle KEEP FIRST/LAST
的内容,根据汇总(名称长度)获取值(乘客名称)。据我所知,PostgreSQL没有这样的功能。
解决这个问题的一种方法是一个技巧:结合长度和名称,获得最大值,然后提取名称:'0005david'
> '0003kim'
等。
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
, substr(max(to_char(char_length(passenger), '0000') || passenger), 5) as name
from table1
group by orig
order by orig;
答案 3 :(得分:0)
t=# with p as (select distinct orig,passenger,length(trim(passenger)),max(length(trim(passenger))) over (partition by orig) from s127)
, o as ( select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from s127
group by orig)
select distinct o.*,p.passenger from o join p on p.orig = o.orig where max=length;
orig | flight_cnt | pass_cnt | bag_cnt_med | passenger
---------+------------+----------+-------------+--------------
lax | 3 | 3 | 6 | ameera
sfo | 3 | 2 | 7 | david
(2 rows)
填入:
t=# create table s127(flight int,orig text,dest text, passenger text, bags int);
CREATE TABLE
Time: 52.678 ms
t=# copy s127 from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1111 | sfo | chi | david | 3
>> 1112 | sfo | dal | david | 7
1112 | sfo | dal | kim | 10
1113 | lax | san | ameera | 5
1114 | lax | lfr | tim | 6
1114 | lax | lfr | jake | 8 >> >> >> >>
>> \.
COPY 6