这是我上一个问题的跟进
Optimizing query to get entire row where one field is the maximum for a group
我将从那里使用的名称更改名称,以使它们更令人难忘,但这些不能代表我的实际用例(因此请不要估计它们的记录数)。
我有一个具有以下架构的表:
OrderTime DATETIME(6),
Customer VARCHAR(50),
DrinkPrice DECIMAL,
Bartender VARCHAR(50),
TimeToPrepareDrink TIME(6),
...
我想从表格中提取代表每天每个客户在欢乐时光(下午3点至下午6点)最昂贵的饮料订单的行。例如,我想要类似的结果
Date | Customer | OrderTime | MaxPrice | Bartender | ...
-------+----------+-------------+------------+-----------+-----
1/1/18 | Alice | 1/1/18 3:45 | 13.15 | Jane | ...
1/1/18 | Bob | 1/1/18 5:12 | 9.08 | Jane | ...
1/1/18 | Carol | 1/1/18 4:45 | 20.00 | Tarzan | ...
1/2/18 | Alice | 1/2/18 3:45 | 13.15 | Jane | ...
1/2/18 | Bob | 1/2/18 5:57 | 6.00 | Tarzan | ...
1/2/18 | Carol | 1/2/18 3:13 | 6.00 | Tarzan | ...
...
该表在OrderTime
上有一个索引,并包含数百亿条记录。 (我的顾客是酗酒者)。
感谢上一个问题,我能够轻松提取特定日期的内容。我可以做类似的事情:
SELECT * FROM orders b
INNER JOIN (
SELECT Customer, MAX(DrinkPrice) as MaxPrice
FROM orders
WHERE OrderTime >= '2018-01-01 15:00'
AND OrderTime <= '2018-01-01 18:00'
GROUP BY Customer
) AS a
ON a.Customer = b.Customer
AND a.MaxPrice = b.DrinkPrice
WHERE b.OrderTime >= '2018-01-01 15:00'
AND b.OrderTime <= '2018-01-01 18:00';
此查询在不到一秒钟的时间内运行。解释计划如下所示:
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| id| select_type | table | type | possible_keys | key | ref | Extra |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| 1 | PRIMARY | b | range | OrderTime | OrderTime | NULL | Using index condition |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | b.Customer,b.Price | |
| 2 | DERIVED | orders | range | OrderTime | OrderTime | NULL | Using index condition; Using temporary; Using filesort |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
我还可以获取有关查询的相关行的信息:
SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0;
此查询还可以在不到一秒钟的时间内运行。这是其解释计划的外观:
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
现在的问题是从表中检索剩余的字段。我尝试像以前那样改编技巧:
SELECT * FROM
orders a
INNER JOIN
(SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0) b
ON a.OrderTime >= TIMESTAMP(b.Date, '15:00:00')
AND a.OrderTime <= TIMESTAMP(b.Date, '18:00:00')
AND a.Customer = b.Customer;
但是,由于我不了解的原因,数据库选择以永久的方式执行此操作。解释计划:
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| 1 | PRIMARY | a | ALL | OrderTime | NULL | NULL | |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | a.Customer | Using where |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 2 | DERIVED | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 4 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
问题:
答案 0 :(得分:0)
该任务似乎是一个“ groupwise-max”问题。这是一种方法,仅涉及2个“查询”(内部查询称为“派生表”)。
SELECT x.OrderDate, x.Customer, b.OrderTime,
x.MaxPrice, b.Bartender
FROM
(
SELECT DATE(OrderTime) AS OrderDate,
Customer,
Max(Price) AS MaxPrice
FROM tbl
WHERE TIME(OrderTime) BETWEEN '15:00' AND '18:00'
GROUP BY OrderDate, Customer
) AS x
JOIN tbl AS b
ON b.OrderDate = X.OrderDate
AND b.customer = x.Customer
AND b.Price = x.MaxPrice
WHERE TIME(b.OrderTime) BETWEEN '15:00' AND '18:00'
ORDER BY x.OrderDate, x.Customer
理想的索引:
INDEX(Customer, Price)
(没有充分的理由使用MyISAM。)
每天数十亿新行
这会增加新的皱纹。每天每天需要超过1TB的额外磁盘空间吗?
是否可以汇总数据?这里的目标是在新数据进入时添加摘要信息,而不必重新扫描数十亿的旧数据。通过 ,您还可以删除事实表上的所有二级索引。
规范化将有助于缩小表的大小,从而加快查询速度。 Bartender
和Customer
是此类的主要候选者-前者可能是SMALLINT UNSIGNED
(2个字节; 65K值),后者是MEDIUMINT UNSIGNED
(3个字节,16M) 。这可能会使您当前显示的5列缩小50%。规范化后,您在许多操作上的速度可能会提高2倍。
规范化最好通过“分段”数据来完成-将数据加载到临时表中,在其中进行规范化,汇总,然后复制到主事实表中。
请参见http://mysql.rjweb.org/doc.php/summarytables
和http://mysql.rjweb.org/doc.php/staging_table
在回到优化一个查询的问题之前,我们需要查看模式,数据流,是否可以规范化事情,汇总表是否有效等等。我希望有一个“答案”该查询主要在摘要表中进行摘要。有时这会导致10倍的加速。
答案 1 :(得分:0)
要从表中提取代表每天每个客户在欢乐时光(下午3点-下午6点)中最昂贵的饮料订单的行,我将在msg
中使用row_number() over()
来评估一天中的小时,像这样:
case expression
注释对OrderTime进行了更改
CREATE TABLE mytable( Date DATE ,Customer VARCHAR(10) ,OrderTime DATETIME ,MaxPrice NUMERIC(12,2) ,Bartender VARCHAR(11) );
建议的查询是这样:
INSERT INTO mytable(Date,Customer,OrderTime,MaxPrice,Bartender) VALUES ('1/1/18','Alice','1/1/18 13:45',13.15,'Jane') , ('1/1/18','Bob' ,'1/1/18 15:12', 9.08,'Jane') , ('1/2/18','Alice','1/2/18 13:45',13.15,'Jane') , ('1/2/18','Bob' ,'1/2/18 15:57', 6.00,'Tarzan') , ('1/2/18','Carol','1/2/18 13:13', 6.00,'Tarzan') ;
,结果将允许您访问派生表中包括的所有列。
Date | Customer | OrderTime | MaxPrice | Bartender | rn :--------- | :------- | :------------------ | -------: | :-------- | -: 0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1 0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
要帮助显示其工作原理,请运行派生表子查询:
select * from ( select * , case when hour(OrderTime) between 15 and 18 then row_number() over(partition by `Date`, customer order by MaxPrice DESC) else null end rn from mytable ) d where rn = 1 ;
产生此临时结果集:
Date | Customer | OrderTime | MaxPrice | Bartender | rn :--------- | :------- | :------------------ | -------: | :-------- | ---: 0001-01-18 | Alice | 0001-01-18 13:45:00 | 13.15 | Jane | null 0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1 0001-02-18 | Alice | 0001-02-18 13:45:00 | 13.15 | Jane | null 0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1 0001-02-18 | Carol | 0001-02-18 13:13:00 | 6.00 | Tarzan | null
db <>提琴here