假设我分别在一对多表格城市和人物中获得了以下数据:
SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 1 | chicago | 1 | charles | 1 |
| 1 | chicago | 2 | celia | 1 |
| 1 | chicago | 3 | curtis | 1 |
| 1 | chicago | 4 | chauncey | 1 |
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 3 | los angeles | 7 | louise | 3 |
| 3 | los angeles | 8 | lucy | 3 |
| 3 | los angeles | 9 | larry | 3 |
+---------+-------------+-----------+-------------+----------------+
9 rows in set (0.00 sec)
我想使用一些特定的逻辑为每个独特城市选择一个人的记录。例如:
SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id
GROUP BY city_id ORDER BY person_name DESC
;
这里的含义是,在每个城市中,我想获得列表上最大的价值,例如:
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 1 | chicago | 1 | curtis | 1 |
+---------+-------------+-----------+-------------+----------------+
然而,我得到的实际输出是:
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 1 | chicago | 1 | charles | 1 |
+---------+-------------+-----------+-------------+----------------+
据我所知,造成这种差异的原因是MySQL首先执行GROUP BY,然后执行ORDER BY。这对我来说是不幸的,因为我希望GROUP BY具有选择逻辑,在其中选择记录。
我可以通过使用一些嵌套的SELECT语句来解决这个问题:
SELECT c.*, p.* FROM city c,
( SELECT p_inner.* FROM
( SELECT * FROM person ORDER BY person_city_id, person_name DESC ) p_inner
GROUP BY person_city_id ) p
WHERE c.city_id = p.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 1 | chicago | 3 | curtis | 1 |
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
+---------+-------------+-----------+-------------+----------------+
当person
表变得任意大时,这似乎非常低效。我假设内部SELECT语句不知道最外面的WHERE过滤器。这是真的吗?
在 GROUP BY之前,有效执行ORDER BY 的最佳方法是什么?
答案 0 :(得分:1)
执行此操作的常用方法(在MySQL中)是将表连接到自身。
首先获得person_name
每city
个最高person_city_id
(即person
表中的SELECT p.*
FROM person p
LEFT JOIN person p2
ON p.person_city_id = p2.person_city_id
AND p.person_name < p2.person_name
WHERE p2.person_name IS NULL
}:
person
这会在每个person_city_id
(您的GROUP BY
变量)中将p2
加入自身,并将表格配对,以便person_name
的{{1}}为大于p
的{{1}}。
因为如果有person_name
没有更大的 p.person_name
(在同一个城市内),那么它是左连接,那么p2.person_name
将是p2.person_name
。这些正是每个城市“最好的”NULL
。
所以要将您的其他信息(从person_name
)加入其中,只需再做一次加入:
city
答案 1 :(得分:0)
您的“解决方案”不是有效的SQL,但它适用于MySQL。但是,您无法确定它是否会因查询优化器代码中的未来更改而中断。只有1级嵌套(仍然无效的SQL)可能会略有改进:
--- Option 1 ---
SELECT
c.*
, p.*
FROM
city AS c
JOIN
( SELECT *
FROM person
ORDER BY person_city_id
, person_name DESC
) AS p
ON c.city_id = p.person_city_id
GROUP BY p.person_city_id
另一种方法(有效的SQL语法,也适用于其他DBMS)是创建一个子查询来选择每个城市的姓氏,然后加入:
--- Option 2 ---
SELECT
c.*
, p.*
FROM
city AS c
JOIN
( SELECT person_city_id
, MAX(person_name) AS person_name
FROM person
GROUP BY person_city_id
) AS pmax
ON c.city_id = pmax.person_city_id
JOIN
person AS p
ON p.person_city_id = pmax.person_city_id
AND p.person_name = pmax.person_name
另一种方式是(表person
)的自连接,以及@mathematical_coffee所描述的<
技巧。
--- Option 3 ---
see @mathematical-coffee's answer
另一种方法是使用LIMIT 1
子查询加入city
和person
:
--- Option 4 ---
SELECT
c.*
, p.*
FROM
city AS c
JOIN
person AS p
ON
p.person_id =
( SELECT person_id
FROM person AS pm
WHERE pm.person_city_id = c.city_id
ORDER BY person_name DESC
LIMIT 1
)
这将为每个城市运行子查询(在表person
上),如果您有InnoDB引擎的(person_city_id, person_name)
索引或MyISAM引擎的(person_city_id, person_name, person_id)
,它将会很有效。
这些选项之间存在一个主要区别:
Oprions 2和3将返回所有绑定的结果(如果您在同一个城市中有两个或更多人按字母顺序排在最后,则会显示两者或全部。)
选项1和4将为每个城市返回一个结果,即使存在关联。您可以通过更改ORDER BY
子句来选择哪一个。
哪个选项更有效还取决于数据的分布,因此最好的方法是全部尝试,检查执行计划并找到适合每个选项的最佳索引。 (person_city_id, person_name)
上的索引很可能对这些查询都有好处。
我的意思是分发:
每个城市的城市人数很少吗? (我认为在这种情况下,选项2和4会表现得更好)
或许多城市每个城市的人数很少? (选项3可能更好用这些数据)。