问题是我需要维护一个包含约数亿行的大型表,并且需要按年和月查询数据库。 如果我创建只包含年份和月份的新列(例如1906(unsigned small int)),而不是直接在timestamp / datetime列上创建索引(秒精度,例如“ 2019-06-03 11”),它将获得更好的性能吗? :22“)?
它会减小索引大小吗?
答案 0 :(得分:1)
我生成了1400万行数据,并通过流程进行了测试,我不知道如何解释结果,但这是结果。
OS: Ubuntu 18.04 (virtual machine)
MySQL: 5.7
执行查询所花费的时间
index data type sample data max min avg
int3 | int(3) | 20170902| 0.248| 0.169| 0.1946
int10 | int(10) | 201709| 0.248| 0.183| 0.2016
smallint | smallint | 1709| 0.306| 0.182| 0.2114
int4 | int(4) | 201709| 0.325| 0.175| 0.2138
date | date | 2017-09-02| 0.397| 0.242| 0.2772
datetime_date | datetime | 2017-09-02 00:00:00| 0.422| 0.278| 0.3108
datetime | datetime | 2017-09-02 05:00:01| 0.437| 0.279| 0.3142
timestamp | timestamp| 2017-09-02 05:00:01| 0.96 | 0.79| 0.8306
timestamp_date| timestamp| 2017-09-02 00:00:00| 0.978| 0.792| 0.8392
DROP TABLE `datetime_index_test`;
CREATE TABLE `datetime_index_test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`datetime` datetime NULL,
`datetime_date` datetime NULL,
`timestamp` timestamp NULL,
`timestamp_date` timestamp NULL,
`smallint` smallint unsigned NULL,
`int10` int(10) unsigned NULL,
`int4` int(4) unsigned NULL,
`int3` int(3) unsigned NULL,
`date` date NULL,
PRIMARY KEY (`id`),
KEY `idx_datetime` (`datetime`),
KEY `idx_datetime_date` (`datetime_date`),
KEY `idx_timestamp` (`timestamp`),
KEY `idx_timestamp_date` (`timestamp_date`),
KEY `idx_smallint` (`smallint`),
KEY `idx_int10` (`int10`),
KEY `idx_int4` (`int4`),
KEY `idx_int3` (`int3`),
KEY `idx_date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
datetime timestamp smallint int10 int4 int3 date datetime_date timestamp_date
2017-09-01 00:17:50| 2017-09-01 00:17:50| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
2017-09-01 01:03:53| 2017-09-01 01:03:53| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
2017-09-01 02:29:56| 2017-09-01 02:29:56| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
2017-09-01 03:15:05| 2017-09-01 03:15:05| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
2017-09-01 04:22:50| 2017-09-01 04:22:50| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
2017-09-01 05:07:05| 2017-09-01 05:07:05| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
2017-09-01 06:41:12| 2017-09-01 06:41:12| 1709| 201709| 201709| 20170901| 2017-09-01| 2017-09-01| 2017-09-01
Index: int3
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `int3`>=20180601 AND `int3`<20180701;
Index: int10
SQL: select count(*) from `datetime_index_test` where `int10`>=201806 and `int10`<201807;
Index: smallint
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `smallint`>=1806 AND `smallint`<1807;
Index: int4
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `int4`>=201806 AND `int4`<201807;
Index: date
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `date`>="2018-06-01 00:00" AND `date`<"2018-07-01 00:00";
Index: datetime_date
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `datetime_date`>="2018-06-01 00:00" AND `datetime_date`<"2018-07-01 00:00";
Index: datetime
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `datetime`>="2018-06-01 00:00" AND `datetime`<"2018-07-01 00:00";
Index: timestamp
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `timestamp`>="2018-06-01 00:00" AND `timestamp`<"2018-07-01 00:00";
Index: timestamp_date
SQL: SELECT COUNT(*) FROM `datetime_index_test` WHERE `timestamp_date`>="2018-06-01 00:00" AND `timestamp_date`<"2018-07-01 00:00";
import pandas as pd
import numpy as np
df = pd.date_range(start="2017-09-01 00:00", end="2019-05-01 00:00", freq='h').rename('datetime').to_frame().reset_index(drop=True)
df = pd.concat([df]*1000, axis=0)
arr = np.random.randint(low=0, high=3600, size=(len(df)))
arr = arr*np.timedelta64(1, 's')
df['datetime'] = df['datetime']+ arr
df = df.sort_values(['datetime'])
df = df.reset_index(drop=True)
df['timestamp'] = df['datetime']
df['smallint'] = df['timestamp'].dt.year-2000
df['smallint'] = df['smallint']*100
df['smallint'] = df['timestamp'].dt.month + df['smallint']
df['int10'] = df['smallint']+ 200000
df['int4'] = df['int10']
df['int3'] = df['int4']*100 + df['datetime'].dt.day
df['date'] = df['timestamp'].dt.date
df['datetime_date'] = df['date']
df['timestamp_date'] = df['date']
答案 1 :(得分:0)
是和是。列中的数据量较小,索引将更快。
答案 2 :(得分:0)
...每天晚上更新其数据的数据仓库...并且大多数数据用于按日期或按周/月进行统计
在这种情况下,您问的是错误的问题。真正的问题是如何有效地从数据仓库获得每日/每周/每月的统计数据。答案是建立和维护摘要表。
由于您每天晚上都要加载一天的新数据(如果我正确地理解了您的陈述),那么这是一个汇总一天的数据并在汇总表中填充行的绝佳时机。这样的表可能只有十分之一的行,并且可以在多行中建立索引。然后显示统计信息可以汇总每日小计,从而非常有效地获取周/月/任意日期范围。这样的表将具有一个DATE
列。根据Chen的研究,这并不是最好的方法,但是与某种形式的int相比,它更容易使用。更重要的是,它可能只占总时间的很小一部分。而且摘要表将小得多,因此与消耗的总磁盘空间相比,一两个字节(以日期列的大小为单位)将无关紧要。