Question

我有两个表：parcel和structure，它们之间有一对多的关系：structure.parcel_id指向parcel.id。

我想选择所有单个结构。我目前的解决方案有效，但非常怪诞：

SELECT 
max(column_1),
max(column_2),
max(column_3),
...
(twenty+ columns)

FROM structure
GROUP BY parcel_id
HAVING count(structure.id) = 1;

由于structure.id不可为空并且上面有HAVING子句，因此根据定义，每个组中只有一行。不幸的是Postgres没有意识到这一点，所以如果我说：

SELECT *    
FROM structure
GROUP BY parcel_id
HAVING count(structure.id) = 1;

然后我得到关于需要为列使用聚合函数的预期错误。我使用任意max()函数来解决这个问题，但这对于试图理解代码的其他人来说很困惑，它迫使我明确列出所有列，这意味着我必须重新进入并编辑此代码每当添加一列时。（不幸的是，在我的环境中经常发生这种情况。）

我有这个替代解决方案，它解决了我的大多数问题：

SELECT * FROM STRUCTURE
WHERE id IN (
    SELECT
        max(id) as id
    FROM structure
    GROUP by structure.parcel_id
    HAVING count(structure.id)  = 1
    );

但这显然增加了我的查询不必要的缓慢，因为查询的频率和表的大小我想避免。

This question与我提出的问题非常相似，但它会抓住每个组的第一行，而不是第一行（也是唯一一组）奇异组。

有没有一种优雅的方法来解决这个问题？

每个请求的示例数据：

structure表：

id | parcel_id | column_1 | column_2 | ...
------------------------------------------
1  |   536     |   ...    | ....     | ...
2  |   536     |   ...    | ....     | ...
3  |   537     |   ...    | ....     | ...
4  |   538     |   ...    | ....     | ...
5  |   538     |   ...    | ....     | ...
6  |   539     |   ...    | ....     | ...
7  |   540     |   ...    | ....     | ...
8  |   541     |   ...    | ....     | ...
9  |   541     |   ...    | ....     | ...

期望的结果：

id | parcel_id | column_1 | column_2 | ...
------------------------------------------
3  |   537     |   ...    | ....     | ...
6  |   539     |   ...    | ....     | ...
7  |   540     |   ...    | ....     | ...

请注意，537,539和540是唯一不重复的parcel_id。

两个表都有~150万行和~25列。

Answer 1

如何使用窗口函数？

SELECT s.*    
FROM (SELECT s.*, COUNT(*) OVER (PARTITION BY parcel_id) as cnt
      FROM structure s
     ) s
WHERE cnt = 1;

但是，更有效的方法可能是：

select s.*
from structure s
where not exists (select 1
                  from structure s2
                  where s2.parcel_id = s.parcel_id and s2.id<> s.id
                 );

特别是，这可以利用structure(parcel_id, id)上的索引。

Answer 2

这应该快得多：

SELECT s.*
FROM  (
   SELECT parcel_id
   FROM   structure
   GROUP  BY 1
   HAVING count(*) = 1
   ) s1
JOIN structure s USING (parcel_id);

您需要的只是(parcel_id)上的索引。

由于查询仅限于唯一parcel_id，因此无需在子查询中包含id。因此，我们可以在(parcel_id)上从普通索引中获取index-only scan - 并使用相同的索引进行连接。
联接应该比IN快一点，并且有一个大的子选择。（虽然它们主要在现代Postgres中产生相同的查询计划。）
count(*)比count(<expression>)快一点，因为只有行的存在才能确定。

旁白：

带有NOT EXISTS反半连接的

@Gordon's 2nd query也应该很快。您只需要(parcel_id, id)上的多列索引。

question you linked to适用于SQL Server。以下是Postgres的一个更相关的相关问题：

Select first row in each GROUP BY group?

从GROUP BY中选择单行和仅单行

2 个答案: