在一个中选择一行的最有效方法:MySQL中的多对表

时间:2012-02-06 00:16:56

标签: mysql performance group-by sql-order-by greatest-n-per-group

假设我分别在一对多表格城市和人物中获得了以下数据:

SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       1 | chicago     |         1 | charles     |              1 |
|       1 | chicago     |         2 | celia       |              1 |
|       1 | chicago     |         3 | curtis      |              1 |
|       1 | chicago     |         4 | chauncey    |              1 |
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
|       3 | los angeles |         7 | louise      |              3 |
|       3 | los angeles |         8 | lucy        |              3 |
|       3 | los angeles |         9 | larry       |              3 |
+---------+-------------+-----------+-------------+----------------+
9 rows in set (0.00 sec)

我想使用一些特定的逻辑为每个独特城市选择一个人的记录。例如:

SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id
GROUP BY city_id ORDER BY person_name DESC
;

这里的含义是,在每个城市中,我想获得列表上最大的价值,例如:

+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
|       1 | chicago     |         1 | curtis      |              1 |
+---------+-------------+-----------+-------------+----------------+

然而,我得到的实际输出是:

+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
|       1 | chicago     |         1 | charles     |              1 |
+---------+-------------+-----------+-------------+----------------+

据我所知,造成这种差异的原因是MySQL首先执行GR​​OUP BY,然后执行ORDER BY。这对我来说是不幸的,因为我希望GROUP BY具有选择逻辑,在其中选择记录。

我可以通过使用一些嵌套的SELECT语句来解决这个问题:

SELECT c.*, p.* FROM city c,
    ( SELECT p_inner.* FROM
        ( SELECT * FROM person ORDER BY person_city_id, person_name DESC ) p_inner
        GROUP BY person_city_id ) p
    WHERE c.city_id = p.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       1 | chicago     |         3 | curtis      |              1 |
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
+---------+-------------+-----------+-------------+----------------+

person表变得任意大时,这似乎非常低效。我假设内部SELECT语句不知道最外面的WHERE过滤器。这是真的吗?

在 GROUP BY之前,有效执行ORDER BY 的最佳方法是什么?

2 个答案:

答案 0 :(得分:1)

执行此操作的常用方法(在MySQL中)是将表连接到自身。

首先获得person_namecity个最高person_city_id(即person表中的SELECT p.* FROM person p LEFT JOIN person p2 ON p.person_city_id = p2.person_city_id AND p.person_name < p2.person_name WHERE p2.person_name IS NULL }:

person

这会在每个person_city_id(您的GROUP BY变量)中将p2加入自身,并将表格配对,以便person_name的{​​{1}}为大于p的{​​{1}}。

因为如果有person_name 没有更大的 p.person_name(在同一个城市内),那么它是左连接,那么p2.person_name将是p2.person_name。这些正是每个城市“最好的”NULL

所以要将您的其他信息(从person_name)加入其中,只需再做一次加入:

city

答案 1 :(得分:0)

您的“解决方案”不是有效的SQL,但它适用于MySQL。但是,您无法确定它是否会因查询优化器代码中的未来更改而中断。只有1级嵌套(仍然无效的SQL)可能会略有改进:

--- Option 1 ---
SELECT 
       c.*
     , p.* 
FROM 
      city AS c
  JOIN
      ( SELECT * 
        FROM person 
        ORDER BY person_city_id
               , person_name DESC 
      ) AS p
    ON  c.city_id = p.person_city_id
GROUP BY p.person_city_id

另一种方法(有效的SQL语法,也适用于其他DBMS)是创建一个子查询来选择每个城市的姓氏,然后加入:

--- Option 2 ---
SELECT 
       c.*
     , p.* 
FROM 
      city AS c
  JOIN
      ( SELECT person_city_id
             , MAX(person_name) AS person_name 
        FROM person 
        GROUP BY person_city_id
      ) AS pmax
    ON  c.city_id = pmax.person_city_id
  JOIN 
      person AS p
    ON  p.person_city_id = pmax.person_city_id
    AND p.person_name = pmax.person_name

另一种方式是(表person)的自连接,以及@mathematical_coffee所描述的<技巧。

--- Option 3 ---
  see @mathematical-coffee's answer

另一种方法是使用LIMIT 1子查询加入cityperson

--- Option 4 ---
SELECT 
       c.*
     , p.* 
FROM 
      city AS c
  JOIN
      person AS p
    ON
      p.person_id =
      ( SELECT person_id
        FROM person AS pm 
        WHERE pm.person_city_id = c.city_id
        ORDER BY person_name DESC
        LIMIT 1
      ) 

这将为每个城市运行子查询(在表person上),如果您有InnoDB引擎的(person_city_id, person_name)索引或MyISAM引擎的(person_city_id, person_name, person_id),它将会很有效。


这些选项之间存在一个主要区别:

Oprions 2和3将返回所有绑定的结果(如果您在同一个城市中有两个或更多人按字母顺序排在最后,则会显示两者或全部。)

选项1和4将为每个城市返回一个结果,即使存在关联。您可以通过更改ORDER BY子句来选择哪一个。


哪个选项更有效还取决于数据的分布,因此最好的方法是全部尝试,检查执行计划并找到适合每个选项的最佳索引。 (person_city_id, person_name)上的索引很可能对这些查询都有好处。

我的意思是分发:

  • 每个城市的城市人数很少吗? (我认为在这种情况下,选项2和4会表现得更好)

  • 或许多城市每个城市的人数很少? (选项3可能更好用这些数据)。