每次从memsql中的80M记录中选择5秒

时间:2016-08-02 15:59:18

标签: memsql columnstore

我们正在使用memsql 5.1进行网络分析项目。每天大约有80M记录和0.5M记录。一个简单的请求大约需要5秒钟 - 每个域收到多少数据,geo,lang为给定的一天。我觉得可以减少那些时间,但我找不到办法。请告诉我。

表格就像一个

    CREATE TABLE `domains` (
    `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    `geo` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `lang` char(5) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `browser` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `os` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `device` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `domain` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `ref` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
    `blk_cnt` int(11) DEFAULT NULL,
    KEY `date` (`date`,`geo`,`lang`,`domain`) /*!90619 USING CLUSTERED COLUMNSTORE */
    /*!90618 , SHARD KEY () */ 
    )

请求如下:

    memsql> explain  SELECT  domain, geo, lang, avg(blk_cnt) as blk_cnt, count(*) as cnt FROM domains WHERE date BETWEEN '2016-07-31 0:00' AND '2016-08-01 0:00'  GROUP BY domain, geo, lang ORDER BY  blk_cnt ASC limit 40;
    +-------------------------------------------------------------------------------------------------------------------------------------------------------+
    | EXPLAIN                                                                                                                                                       |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Project [r0.domain, r0.geo, r0.lang, $0 / CAST(COALESCE($1,0) AS SIGNED) AS blk_cnt, CAST(COALESCE($2,0) AS SIGNED) AS cnt]                                   |
    | Top limit:40                                                                                                                                                  |
    | GatherMerge [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] partitions:all est_rows:40                                                                    |
    | Project [r0.domain, r0.geo, r0.lang, s / CAST(COALESCE(c,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(cnt_1,0) AS SIGNED) AS cnt, s, c, cnt_1] est_rows:40 |
    | TopSort limit:40 [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)]                                                                                          |
    | HashGroupBy [SUM(r0.s) AS s, SUM(r0.c) AS c, SUM(r0.cnt) AS cnt_1] groups:[r0.domain, r0.geo, r0.lang]                                                        |
    | TableScan r0 storage:list stream:no                                                                                                                           |
    | Repartition [domains.domain, domains.geo, domains.lang, cnt, s, c] AS r0 shard_key:[domain, geo, lang] est_rows:40 est_select_cost:144350216          |
    | HashGroupBy [COUNT(*) AS cnt, SUM(domains.blk_cnt) AS s, COUNT(domains.blk_cnt) AS c] groups:[domains.domain, domains.geo, domains.lang]              |
    | Filter [domains.date >= '2016-07-31 0:00' AND domains.date <= '2016-08-01 0:00']                                                                              |
    | ColumnStoreScan scan_js_data.domains, KEY date (date, geo, lang, domain) USING CLUSTERED COLUMNSTORE est_table_rows:72175108 est_filtered:18043777    |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------+

在应用建议

之后
  • 原始查询的时间 - 5s
  • 带时间戳优化--3.7s
  • with timestamp + shardkey - 2.6s

非常感谢你!

1 个答案:

答案 0 :(得分:0)

执行group by可能是此查询中最昂贵的部分。使用与组匹配的分片键,即SHARD KEY (domain, geo, lang),将允许更快地执行分组。