我有以下查询:
SELECT
analytics.source AS referrer,
COUNT(analytics.id) AS frequency,
SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094
GROUP BY analytics.source
ORDER BY frequency DESC
LIMIT 10
分析表有6000万行,交易表有300万行。
在此查询上运行EXPLAIN
时,我得到:
+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+
| # id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+
| '1' | 'SIMPLE' | 'analytics' | 'ref' | 'analytics_user_id | analytics_source' | 'analytics_user_id' | '5' | 'const' | '337662' | 'Using where; Using temporary; Using filesort' |
| '1' | 'SIMPLE' | 'transactions' | 'ref' | 'tran_analytics' | 'tran_analytics' | '5' | 'dijishop2.analytics.id' | '1' | NULL | |
+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+
我不知道如何优化此查询,因为它已经非常基础了。运行此查询大约需要70秒。
以下是存在的索引:
+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
| # Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
| 'analytics' | '0' | 'PRIMARY' | '1' | 'id' | 'A' | '56934235' | NULL | NULL | '' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_user_id' | '1' | 'user_id' | 'A' | '130583' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_product_id' | '1' | 'product_id' | 'A' | '490812' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_affil_user_id' | '1' | 'affil_user_id' | 'A' | '55222' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_source' | '1' | 'source' | 'A' | '24604' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_country_name' | '1' | 'country_name' | 'A' | '39510' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_gordon' | '1' | 'id' | 'A' | '56934235' | NULL | NULL | '' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_gordon' | '2' | 'user_id' | 'A' | '56934235' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'analytics' | '1' | 'analytics_gordon' | '3' | 'source' | 'A' | '56934235' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
+----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
| # Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
| 'transactions' | '0' | 'PRIMARY' | '1' | 'id' | 'A' | '2436151' | NULL | NULL | '' | 'BTREE' | '' | '' |
| 'transactions' | '1' | 'tran_user_id' | '1' | 'user_id' | 'A' | '56654' | NULL | NULL | '' | 'BTREE' | '' | '' |
| 'transactions' | '1' | 'transaction_id' | '1' | 'transaction_id' | 'A' | '2436151' | '191' | NULL | 'YES' | 'BTREE' | '' | '' |
| 'transactions' | '1' | 'tran_analytics' | '1' | 'analytics' | 'A' | '2436151' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'transactions' | '1' | 'tran_status' | '1' | 'status' | 'A' | '22' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'transactions' | '1' | 'gordon_trans' | '1' | 'status' | 'A' | '22' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
| 'transactions' | '1' | 'gordon_trans' | '2' | 'analytics' | 'A' | '2436151' | NULL | NULL | 'YES' | 'BTREE' | '' | '' |
+----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
在按照建议的方式添加任何额外索引之前,简化了两个表的架构,因为这并不能改善情况。
CREATE TABLE `analytics` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`affil_user_id` int(11) DEFAULT NULL,
`product_id` int(11) DEFAULT NULL,
`medium` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`terms` varchar(1024) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`is_browser` tinyint(1) DEFAULT NULL,
`is_mobile` tinyint(1) DEFAULT NULL,
`is_robot` tinyint(1) DEFAULT NULL,
`browser` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`mobile` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`robot` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`platform` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`referrer` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`domain` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`ip` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`continent_code` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`country_name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`city` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `analytics_user_id` (`user_id`),
KEY `analytics_product_id` (`product_id`),
KEY `analytics_affil_user_id` (`affil_user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=64821325 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
CREATE TABLE `transactions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`transaction_id` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`user_id` int(11) NOT NULL,
`pay_key` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`sender_email` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`amount` decimal(10,2) DEFAULT NULL,
`currency` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`analytics` int(11) DEFAULT NULL,
`ip_address` varchar(46) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`session_id` varchar(60) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`eu_vat_applied` int(1) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `tran_user_id` (`user_id`),
KEY `transaction_id` (`transaction_id`(191)),
KEY `tran_analytics` (`analytics`),
KEY `tran_status` (`status`)
) ENGINE=InnoDB AUTO_INCREMENT=10019356 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
如果以上无法进一步优化。关于汇总表的任何实施建议都将非常有用。我们正在AWS上使用LAMP堆栈。上面的查询正在RDS(m1.large)上运行。
答案 0 :(得分:10)
我将创建以下索引(b树索引):
analytics(user_id, source, id)
transactions(analytics, status)
这与戈登的建议不同。
索引中列的顺序很重要。
您按特定的analytics.user_id
进行过滤,因此该字段必须是索引中的第一个字段。
然后,您按analytics.source
分组。为了避免按source
进行排序,这应该是索引的下一个字段。您还引用了analytics.id
,因此最好将此字段作为索引的一部分,放在最后。 MySQL是否能够只读取索引而不接触表?我不知道,但是测试起来很容易。
transactions
上的索引必须以analytics
开头,因为它将在JOIN
中使用。我们还需要status
。
SELECT
analytics.source AS referrer,
COUNT(analytics.id) AS frequency,
SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094
GROUP BY analytics.source
ORDER BY frequency DESC
LIMIT 10
答案 1 :(得分:7)
首先分析...
SELECT a.source AS referrer,
COUNT(*) AS frequency, -- See question below
SUM(t.status = 'COMPLETED') AS sales
FROM analytics AS a
LEFT JOIN transactions AS t ON a.id = t.analytics AS a
WHERE a.user_id = 52094
GROUP BY a.source
ORDER BY frequency DESC
LIMIT 10
如果从a
到t
的映射是“一对多”,则需要考虑COUNT
和SUM
是否具有正确的值或虚增值。如查询所示,它们是“膨胀的”。 JOIN
发生在聚合之前 ,因此您要计算事务的数量和完成的数量。我认为这是理想的。
注意:通常的模式是COUNT(*)
;说COUNT(x)
意味着检查x
是否为NULL
。我怀疑不需要检查吗?
此索引处理WHERE
,并且正在“覆盖”:
analytics: INDEX(user_id, source, id) -- user_id first
transactions: INDEX(analytics, status) -- in this order
GROUP BY
可能需要也可能不需要“排序”。 ORDER BY
与GROUP BY
不同,肯定需要进行排序。整个分组的行将需要排序; LIMIT
没有捷径。
通常,摘要表面向日期。也就是说,PRIMARY KEY
包含“日期”和其他一些维度。也许,按日期和user_id键入键会有意义吗?一般用户每天有几笔交易?如果至少为10,则考虑一个汇总表。另外,重要的是不要成为UPDATEing
或DELETEing
的旧记录。 More
我可能会
user_id ...,
source ...,
dy DATE ...,
status ...,
freq MEDIUMINT UNSIGNED NOT NULL,
status_ct MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY(user_id, status, source, dy)
然后查询变为
SELECT source AS referrer,
SUM(freq) AS frequency,
SUM(status_ct) AS completed_sales
FROM Summary
WHERE user_id = 52094
AND status = 'COMPLETED'
GROUP BY source
ORDER BY frequency DESC
LIMIT 10
速度来自许多因素
JOIN
(它仍然需要额外的排序。)
即使没有摘要表,也可能会加快速度...
Normalizing
一些既庞大又重复的字符串可能会使该表不受I / O约束。KEY (transaction_id(191))
;请参见here,了解5种解决方法。utf8mb4_unicode_ci
。 (39)和ascii就足够了。答案 2 :(得分:6)
对于此查询:
SELECT a.source AS referrer,
COUNT(*) AS frequency,
SUM( t.status = 'COMPLETED' ) AS sales
FROM analytics a LEFT JOIN
transactions t
ON a.id = t.analytics
WHERE a.user_id = 52094
GROUP BY a.source
ORDER BY frequency DESC
LIMIT 10 ;
您要在analytics(user_id, id, source)
和transactions(analytics, status)
上建立索引。
答案 3 :(得分:4)
尝试以下方法,让我知道是否有帮助。
SELECT
analytics.source AS referrer,
COUNT(analytics.id) AS frequency,
SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM (SELECT * FROM analytics where user_id = 52094) analytics
LEFT JOIN (SELECT analytics, status from transactions where analytics = 52094) transactions ON analytics.id = transactions.analytics
GROUP BY analytics.source
ORDER BY frequency DESC
LIMIT 10
答案 4 :(得分:3)
您能否尝试以下方法:
SELECT
analytics.source AS referrer,
COUNT(analytics.id) AS frequency,
SUM(sales) AS sales
FROM analytics
LEFT JOIN(
SELECT transactions.Analytics, (CASE WHEN transactions.status = 'COMPLETED' THEN 1 ELSE 0 END) AS sales
FROM analytics INNER JOIN transactions ON analytics.id = transactions.analytics
) Tra
ON analytics.id = Tra.analytics
WHERE analytics.user_id = 52094
GROUP BY analytics.source
ORDER BY frequency DESC
LIMIT 10
答案 5 :(得分:3)
此查询可能将数百万个analytics
记录与transactions
个记录联接在一起,并计算数百万个记录的总和(包括状态检查)。
如果我们可以首先应用LIMIT 10
,然后进行联接并计算总和,则可以加快查询速度。
不幸的是,我们需要analytics.id
进行连接,在应用GROUP BY
之后会丢失。但是也许analytics.source
的选择性足以提高查询量。
因此,我的想法是计算频率,对其进行限制,以在子查询中返回analytics.source
和frequency
,并使用该结果过滤主查询中的analytics
,然后对希望减少的记录数进行其余的联接和计算。
最小子查询(注意:无联接,无总和,返回10条记录):
SELECT
source,
COUNT(id) AS frequency
FROM analytics
WHERE user_id = 52094
GROUP BY source
ORDER BY frequency DESC
LIMIT 10
使用上述查询作为子查询x
的完整查询:
SELECT
x.source AS referrer,
x.frequency,
SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales
FROM
(<subquery here>) x
INNER JOIN analytics a
ON x.source = a.source -- This reduces the number of records
LEFT JOIN transactions t
ON a.id = t.analytics
WHERE a.user_id = 52094 -- We could have several users per source
GROUP BY x.source, x.frequency
ORDER BY x.frequency DESC
如果这不能带来预期的性能提升,则可能是由于MySQL以意外的方式应用了联接。如此处"Is there a way to force MySQL execution order?"所述,在这种情况下,您可以将连接替换为STRAIGHT_JOIN
。
答案 6 :(得分:2)
我会尝试子查询:
SELECT a.source AS referrer,
COUNT(*) AS frequency,
SUM((SELECT COUNT(*) FROM transactions t
WHERE a.id = t.analytics AND t.status = 'COMPLETED')) AS sales
FROM analytics a
WHERE a.user_id = 52094
GROUP BY a.source
ORDER BY frequency DESC
LIMIT 10;
Plus的索引与@Gordon的答案完全相同:分析(用户ID,ID,源)和交易(分析,状态)。
答案 7 :(得分:2)
我在您的查询中发现的唯一问题是
GROUP BY analytics.source
ORDER BY frequency DESC
因为此查询正在使用临时表进行文件排序。
避免这种情况的一种方法是创建另一个表,例如
CREATE TABLE `analytics_aggr` (
`source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`frequency` int(10) DEFAULT NULL,
`sales` int(10) DEFAULT NULL,
KEY `sales` (`sales`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;`
使用以下查询将数据插入analytics_aggr
insert into analytics_aggr SELECT
analytics.source AS referrer,
COUNT(analytics.id) AS frequency,
SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094
GROUP BY analytics.source
ORDER BY null
现在您可以使用
轻松获取数据select * from analytics_aggr order by sales desc
答案 8 :(得分:2)
尝试一下
SELECT
a.source AS referrer,
COUNT(a.id) AS frequency,
SUM(t.sales) AS sales
FROM (Select id, source From analytics Where user_id = 52094) a
LEFT JOIN (Select analytics, case when status = 'COMPLETED' Then 1 else 0 end as sales
From transactions) t ON a.id = t.analytics
GROUP BY a.source
ORDER BY frequency DESC
LIMIT 10
我之所以提出这个建议,是因为您说“它们是大表”,但是此sql仅使用很少的列。在这种情况下,如果我们仅将内联视图与require列一起使用,那将会很好
注意:内存在这里也将起重要作用。因此在确定内联视图之前请先确认内存
答案 9 :(得分:2)
我将尝试从两个表中分离查询。由于您只需要排名前10位的source
,因此,我会先获取它们,然后从transactions
的{{1}}列中进行查询:
sales
如果没有SELECT source as referrer
,frequency
,(select count(*)
from transactions t
where t.analytics in (select distinct id
from analytics
where user_id = 52094
and source = by_frequency.source)
and status = 'completed'
) as sales
from (SELECT analytics.source
,count(*) as frequency
from analytics
where analytics.user_id = 52094
group by analytics.source
order by frequency desc
limit 10
) by_frequency
答案 10 :(得分:2)
我假设谓词user_id = 52094出于说明目的,在应用中,所选的user_id是变量。
我还假定ACID属性在这里不是很重要。
(1)因此,我将使用实用程序表维护两个仅具有必要字段的副本表(这类似于弗拉基米尔在上面建议的索引)。
CREATE TABLE mv_anal (
`id` int(11) NOT NULL,
`user_id` int(11) DEFAULT NULL,
`source` varchar(45),
PRIMARY KEY (`id`)
);
CREATE TABLE mv_trans (
`id` int(11) NOT NULL,
`status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`analytics` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE util (
last_updated_anal int (11) NOT NULL,
last_updated_trans int (11) NOT NULL
);
INSERT INTO util (0, 0);
这样做的好处是,我们将读取原始表的相对较小的投影-希望OS级别和DB级别的缓存起作用,并且不是从较慢的辅助存储中读取,而是从较快的RAM中读取。 这是非常巨大的收获。
这是我更新两个表的方式(下面是cron运行的事务):
-- TRANSACTION STARTS --
INSERT INTO mv_trans
SELECT id, IF (status = 'COMPLETE', 1, 0) AS status, analysis
FROM transactions JOIN util
ON util.last_updated_trans <= transactions.id
UPDATE util
SET last_updated_trans = sub.m
FROM (SELECT MAX (id) AS m FROM mv_trans) sub;
-- TRANSACTION COMMITS --
-- similar transaction for mv_anal.
(2)现在,我将解决选择性问题,以减少顺序扫描时间。我将必须在mv_anal的user_id,source和id(按此顺序)上建立b树索引。
注意:可以通过仅在分析表上创建索引来实现上述目的,但是建立这样的索引需要读取具有6000万行的大表。我的方法要求索引建立只能读取非常薄的表。因此,我们可以更频繁地重建btree(以解决偏斜问题,因为该表仅是追加的)。
这就是我确保查询时实现高选择性并解决倾斜的btree问题的方法。
(3)在PostgreSQL中,总是实现WITH子查询。我希望MySQL同样如此。因此,作为优化的最后一英里:
WITH sub_anal AS (
SELECT user_id, source AS referrer, COUNT (id) AS frequency
FROM mv_anal
WHERE user_id = 52094
GROUP BY user_id, source
ORDER BY COUNT (id) DESC
LIMIT 10
)
SELECT sa.referrer, sa.frequency, SUM (status) AS sales
FROM sub_anal AS sa
JOIN mv_anal anal
ON sa.referrer = anal.source AND sa.user_id = anal.user_id
JOIN mv_trans AS trans
ON anal.id = trans.analytics
答案 11 :(得分:1)
晚聚会。我认为您需要将一个索引加载到MySQL的缓存中。 NLJ可能会降低性能。这是我的看法:
路径
您的查询很简单。它有两个表,“路径”非常清楚:
analytics
表。transactions
表。这是因为您使用的是LEFT OUTER JOIN
。对此没有太多讨论。analytics
表有6000万行,最佳路径应尽快对此行进行过滤。访问权限
清除路径后,您需要确定要使用索引访问还是表访问。两者都有优点和缺点。但是,您想提高SELECT
的性能:
过滤
同样,您想要SELECT
的高性能。因此:
行汇总
过滤后,下一步是按GROUP BY analytics.source
聚合行。可以通过将source
列作为索引的第一列来改善这一点。
路径,访问,过滤和聚合的最佳索引
考虑到以上所有内容,您应该将所有提到的列都包含在索引中。以下索引可以缩短响应时间:
create index ix1_analytics on analytics (user_id, source, id);
create index ix2_transactions on transactions (analytics, status);
这些索引满足上面描述的“路径”,“访问”和“过滤”策略。
索引缓存
最后-这很关键-将二级索引加载到MySQL的内存缓存中。 MySQL正在执行NLJ(嵌套循环连接)-MySQL术语中的“引用”,并且需要随机访问第二个URL,将近200k次。
不幸的是,我不确定如何将索引加载到MySQL的缓存中。可以使用FORCE
,如下所示:
SELECT
analytics.source AS referrer,
COUNT(analytics.id) AS frequency,
SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions FORCE index (ix2_transactions)
ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094
GROUP BY analytics.source
ORDER BY frequency DESC
LIMIT 10
确保您有足够的缓存空间。这是一个简短的问题/答案,供您参考:How to figure out if mysql index fits entirely in memory
祝你好运!哦,然后发布结果。
答案 12 :(得分:1)
这个问题肯定受到了很多关注,因此我确信所有明显的解决方案都已尝试过。不过,我在查询中没有看到解决LEFT JOIN
的内容。
我注意到LEFT JOIN
语句通常会迫使查询计划者进行哈希联接,这对于少量结果来说是快速的,但是对于大量结果来说却非常慢。如@Rick James的回答所述,由于原始查询中的联接位于标识字段analytics.id
上,因此这将生成大量结果。哈希联接将产生可怕的性能结果。下面的建议在没有任何架构或处理更改的情况下解决了此问题。
由于聚合是通过analytics.source
进行的,因此我将尝试执行一个查询,该查询将按来源和销售频率分别创建聚合,并将左联接推迟到聚合完成之后。这样应该可以最佳地使用索引(通常这是大型数据集的合并联接)。
这是我的建议:
SELECT t1.source AS referrer, t1.frequency, t2.sales
FROM (
-- Frequency by source
SELECT a.source, COUNT(a.id) AS frequency
FROM analytics a
WHERE a.user_id=52094
GROUP BY a.source
) t1
LEFT JOIN (
-- Sales by source
SELECT a.source,
SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics a
JOIN transactions t
WHERE a.id = t.analytics
AND t.status = 'COMPLETED'
AND a.user_id=52094
GROUP by a.source
) t2
ON t1.source = t2.source
ORDER BY frequency DESC
LIMIT 10
希望这会有所帮助。