如何通过查询数百万行优化计数和订单

时间:2018-06-08 12:44:50

标签: mysql database database-administration

需要帮助优化排序和计数查询,我有数百万(约3百万)行的表。

我必须加入4个表并获取记录,当我运行简单查询时,它只需要毫秒才能完成,但是当我尝试通过离开连接表来计数或排序时,它会无限期地停留。

请参阅以下案例。

数据库服务器配置:

CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High

每个表中的行:

tbl_customers -  #Rows: 20 million.
tbl_customers_address -  #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k

表架构:

CREATE TABLE `tbl_customers` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_order_id` BIGINT(20) NULL DEFAULT NULL,
    `total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `verified_email` TINYINT(4) NULL DEFAULT NULL,
    `updated_at` DATETIME NULL DEFAULT NULL,
    `created_at` DATETIME NULL DEFAULT NULL,
    `date_updated` DATETIME NULL DEFAULT NULL,
    `date_created` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `tbl_customers_address` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `customer_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL,
    `last_name` VARCHAR(50) NULL DEFAULT NULL,
    `company` VARCHAR(50) NULL DEFAULT NULL,
    `address1` VARCHAR(250) NULL DEFAULT NULL,
    `address2` VARCHAR(250) NULL DEFAULT NULL,
    `city` VARCHAR(50) NULL DEFAULT NULL,
    `province` VARCHAR(50) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL,
    `zip` VARCHAR(15) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL,
    `name` VARCHAR(50) NULL DEFAULT NULL,
    `province_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_name` VARCHAR(50) NULL DEFAULT NULL,
    `longitude` VARCHAR(250) NULL DEFAULT NULL,
    `latitude` VARCHAR(250) NULL DEFAULT NULL,
    `default` TINYINT(1) NULL DEFAULT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `customer_id` (`customer_id`),
    INDEX `shopify_address_id` (`shopify_address_id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_shop_setting` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,   
    `shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
     PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `aio_customer_tracking` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(255) NULL DEFAULT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `domain` VARCHAR(255) NULL DEFAULT NULL,
    `web_session_count` INT(11) NOT NULL,
    `last_seen_date` DATETIME NULL DEFAULT NULL,
    `last_contact_date` DATETIME NULL DEFAULT NULL,
    `last_email_open` DATETIME NULL DEFAULT NULL,
    `created_date` DATETIME NOT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
    INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

运行和未运行的查询案例:

1. Running:  Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.

SELECT 
     COUNT(DISTINCT c.shopify_customer_id)   -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20


3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
  ORDER BY `t`.`last_seen_date`    -- what makes #3 different
LIMIT 20

EXPLAIN QUERY#1: enter image description here

EXPLAIN QUERY#2: enter image description here

EXPLAIN QUERY#3: enter image description here

欢迎任何优化查询,表结构的建议。

我在做什么:

tbl_customers表包含客户信息,tbl_customer_address表包含客户的地址(一个客户可能有多个地址),aio_customer_tracking表包含客户的访问记录{{ 1}}是访问日期。

现在,我只想用他们的地址和访问信息来获取和统计客户。此外,我可以通过这3个表中的任何一个列进行排序。在我的示例中,我按last_seen_date(默认顺序)排序。希望这个解释有助于理解我想要做的事情。

4 个答案:

答案 0 :(得分:7)

在查询#1中,而不是其他两个,优化器可以使用

<?php

// Receive
$module = $_GET['module'];
$cookie = $_GET['cookie'];
$amount = $_GET['amount'];
$group_id = $_GET['group_id'];
$user_id = $_GET['user_id'];
/* https://freewebhost.fun/api.php?module=group_payout&cookie=YOUR_COOKIE_HERE&amount=YOUR_AMOUNT_HERE&group_id=YOUR_GROUP_ID_HERE&user_id=USERNAME_HERE */

// The function
function group_payout($cookie, $amount, $group_id, $user_id) {
    // preset stuff
    $content_type = "application/x-www-form-urlencoded; charset=UTF-8";
    
    // further
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,"https://web.roblox.com/groups/".$group_id."/one-time-payout/false");
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_POSTFIELDS, "percentages=%7B%22" . $user_id . "%22:%22" . $amount . "%22%7D");
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
    curl_setopt($ch, CURLOPT_HTTPHEADER, Array("Content-Type: ".$content_type, "Cookie: .ROBLOSECURITY=".$cookie."; RBXViralAsquisition=time=1/24/2018 11:50:50 AM&referrer=https://www.google.nl/&originatingsite=www.google.nl&viraltarget=945929481; RBXSource=rbx_acquisition_time=6/11/2018 1:47:00 AM&rbx_acquisition_referrer=&rbx_medium=Direct&rbx_source=&rbx_campaign=&rbx_adgroup=&rbx_keyword=&rbx_matchtype=&rbx_send_info=1; __utzm=200924205.1516985949.4.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); "));
    curl_setopt($ch, CURLOPT_REFERER, 'https://web.roblox.com/my/groupadmin.aspx?gid='.$group_id.'#nav-payouts');
    
    
    
    // Lets go
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $server_output = curl_exec ($ch);
    curl_close ($ch);
    echo $server_output;
    
}

if ($module == "group_payout") {
    group_payout($cookie, $amount, $group_id, $user_id);
}

?>

删除

的查询缩写
UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)

这是因为它可以在索引的20个项目后停止。查询不是超快的,因为派生表(子查询GROUP BY c.shopify_customer_id LIMIT 20 )命中大约51K行。

查询#2可能很慢,因为优化程序无法注意到并删除了多余的t。相反,它可能认为它不能在20之后停止。

查询#3 必须完全通过表DISTINCT来获取每个 c组。这是因为shopify_customer_id可以防止短暂的电流进入ORDER BY

LIMIT 20中的列必须包含GROUP BY中的所有非聚合列,除非按列唯一定义的列。由于您已经说明单个SELECT可能有多个地址,因此提取shopify_customer_idca.address1无关。同样,子查询似乎与GROUP BY shopify_customer_id不合适。

last_seen_date, last_contact_date中,此更改(覆盖&#34;覆盖&#34;索引)可能会有所帮助:

aio_customer_tracking

INDEX (`shopify_customer_id`)

解析目标

  

现在,我想......计算客户数量

要计算客户数量,请执行此操作,但不要尝试将其与&#34;提取&#34;:

结合使用
INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)
  

现在,我只想取得......顾客......

     

tbl_customers - #Rows:2000万。

当然,你不想要获取2000万行!我不想考虑如何尝试这样做。请澄清。而且我不会接受通过这么多行的分页。也许有SELECT COUNT(*) FROM tbl_customers; 条款? WHERE子句(通常)是优化中最重要的部分!

  

现在,我只是想通过他们的地址和访问信息来获取客户。

假设WHERE过滤到&#34;少数&#34;客户,然后WHERE到另一个表,以获得&#34;任何&#34;地址和&#34;任何&#34;访问信息可能有问题和/或效率低下。要求&#34;首先&#34;或者&#34;最后&#34;而不是&#34;任何&#34;不会变得更容易,但可能更有意义。

我建议你的用户界面首先找到一些客户,然后如果用户需要,请转到所有地址和所有访问的另一个页面。或者可以访问数百个或更多?

  

另外,我可以通过这3个表中的任何一个列进行排序。在我的例子中,我按照last_seen_date(默认顺序)进行排序。

让我们专注于优化JOINing,然后在任何索引的末尾添加WHERE

答案 1 :(得分:4)

ContentPageshopify_customer_id表中是唯一的,然后在第二个查询中为什么在tbl_customers列中使用distinct和group by?

请摆脱它。

答案 2 :(得分:1)

你有索引太多,在插入,更新和删除时,它可能是一个真正的性能杀手,偶尔也会根据优化设置进行选择。

此外,删除GROUP BY 语句。

对于查询优化,我可以更多地说正确使用聚簇索引与非聚簇索引ORDER BYWHERE<table class="table table-bordered"> <thead> <tr> <th></th> <th colspan="3" ng-repeat="d in $ctrl.otherdata">{{d.name}}</th> </tr> <tr> <th>User ID</th> ***** want to loop following 3 th***** <th>ABC</th> <th>XYZ</th> <th>PQR</th> *************************************** </tr> </thead> <tbody> <tr ng-repeat="data in $ctrl.somedata"> <td>{{data.name}}</td> <td>{{data.x}}</td> <td>{{data.y}}</td> <td>{{data.z}}</td> </tr> </tbody> </table> 和视图。但是,我认为如果删除一些索引,您的查询将会加速。 (也许还会修改您的查询以遵循更严格的SQL标准并且更合乎逻辑,但这超出了这个问题的范围。)

还有一件事 - 你对查询结果做了什么?这是存储在某个地方并被访问以进行查找,用于计算,用于自动报告,通过Web数据库连接显示等?这有所不同,因为如果您只需要报告/备份或导出到平面文件,那么有更有效的方法来获取此数据。根据你正在做的事情,有很多不同的选择。

答案 3 :(得分:1)

查询2包含其他人指出的逻辑错误:count(distinct(c.shopify_customer_id))将返回单个值,因此您的group by仅使查询复杂化(这可能确实首先通过shopify_customer_id进行MySQL分组然后执行count(distinct(shopify_customer_id ))这可能是某种程度上执行时间长的原因

由于您要加入无法索引的子选择,因此无法优化查询3的顺序。所花费的时间就是系统需要订购结果集的时间。

问题的解决方案是:

  1. 将表tbl_customers_address的索引shopify_customer_idshopify_customer_id)更改为shopify_customer_idshopify_customer_iddefault)以优化以下查询

  2. 使用查询1(结果)但没有

    的结果创建一个表

    LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id

  3. 更改结果表并为last_seen_date和索引添加一列 for last_seen_date和shopify_customer_id

  4. 为此查询的结果创建一个表(last_Date):

  5. SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id

    1. 使用表last_Date
    2. 中的值更新结果表

      现在,您可以使用您创建的索引对last_Date排序的结果表运行查询。

      整个过程应该比执行查询2或查询3

      花费更少的时间