我正在为家用电器制造物联网系统。
我的数据表已创建为
mysql> SHOW CREATE TABLE DataM1\G
*************************** 1. row ***************************
Table: DataM1
Create Table: CREATE TABLE `DataM1` (
`sensor_type` text,
`sensor_name` text,
`timestamp` datetime DEFAULT NULL,
`data_type` text,
`massimo` float DEFAULT NULL,
`minimo` float DEFAULT NULL,
KEY `timestamp_id` (`timestamp`) USING BTREE,
KEY `super_index_id` (`timestamp`,`sensor_name`(11),`data_type`(11)) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
,查询是
SELECT
sensor_type, sensor_name, timestamp, data_type,
MAX(massimo) as massimo, MIN(minimo) as minimo
FROM DataM1
WHERE timestamp >= NOW() - INTERVAL 1 HOUR
GROUP BY timestamp, sensor_type, sensor_name, data_type;
现在,问题是当表达到400万(几天)时,查询需要50秒以上。
编辑:EXPLAIN结果如下:
id: 1
select_type: SIMPLE
table: DataM1
partitions: p0,p1,p2,p3,p4,p5,p6
type: range
possible_keys: timestamp_id,super_index_id
key: timestamp_id
key_len: 6
ref: NULL
rows: 1
filtered: 100.00
Extra: Using index condition; Using temporary; Using filesort
编辑:回复的示例行是:
*************************** 418037. row ***************************
sensor_type: SEN
sensor_name: SEN_N2
timestamp: 2016-10-16 17:28:48
data_type: flow_rate
massimo: 17533.8
minimo: 17533.5
编辑:我已将值timestamp,sensor_type,sensor_name和data_type标准化,并创建了一个_view以方便消费数据:
CREATE VIEW `_view` AS (
select (
select `vtmp`.`timestamp` from `timestamp` `vtmp` where (`vtmp`.`no` = `pm`.`timestamp`)) AS `timestamp`,(
select `vtmp`.`sensor_type` from `sensor_type` `vtmp` where (`vtmp`.`no` = `pm`.`sensor_type`)) AS `sensor_type`,(
select `vtmp`.`sensor_name` from `sensor_name` `vtmp` where (`vtmp`.`no` = `pm`.`sensor_name`)) AS `sensor_name`,(
select `vtmp`.`data_type` from `data_type` `vtmp` where (`vtmp`.`no` = `pm`.`data_type`)) AS `data_type`,
`pm`.`massimo` AS `massimo`,
`pm`.`minimo` AS `minimo`
from `datam1` `pm` order by `pm`.`timestamp` desc);
有没有办法加快索引,分片和/或分区? 或者最好重新考虑将表中的信息分开的表格?如果是这样,有人可以在这种情况下提出他的最佳做法吗?
答案 0 :(得分:2)
您可以通过在用于排序的列上添加复合索引来加快GROUP BY查询:
while True:
start= input('Press q to quite, enter any other key to start')
if start.lower()=='q':
break
#pick a random words
word=random.choice(words)
wrong_guesses=[]
right_guesses=[]
while len(wrong_guess) < 7 and len(right_guesses) != len(word)
#draw spaces
for letter in word:
if letter in right_guesses:
print(letter, end='')
else:
print('_', end='')
print('')
print('strikes: {}/7'.format(len(bad_guesses))
print('')
#take guess
guess= input('guess a letter: ').lower()
if len(guess) != 1:
print('You can only guess a sinlge letter!')
#what is this>>> continue
elif guess in wrong_guesses or guess in right_guesses:
print('you\'ve already guessed that letter!')
continue
elif not guess.isalpha():
print('you can only guess letters!')
#what is this>>> continue
if guess in word:
right_guesses.append(guess)
if len(right_guesses)==len(list(word)):
print('You win! The word was {}'.format(list(word))
break
else:
wrong_guesses.append(guess)
else:
print('you didnt guess it! My secret word was {}'.format(word))
匹配
GROUP BY timestamp, sensor_type, sensor_name, data_type;
另请注意上述索引中的(11):
对于TEXT列,MySQL需要限制这些列的内容以进行索引。您还可以通过选择更合适的数据类型来更快地查询,例如传感器的INT和数据类型(您只有几种不同的类型,对吗?)和sensor_name的VARCHAR(128)。
同样是的,更改数据布局也会带来一些好处。将传感器信息(类型+名称)存储在不同的表中,然后将其与数据表中的sensor_id相关联。这样,只需要对单个INT列进行排序(=分组),这比排序两个TEXT列要好得多。
答案 1 :(得分:2)
sensor_name(11)
之类的“前缀”索引;它很少有帮助,有时也很痛。TEXT
;相反VARCHAR(...)
有一些现实限制。ENUM
是一个合理的选择。PRIMARY KEY
。如果没有列(或列集)是唯一的,则使用AUTO_INCREMENT
。GROUP BY
。也许截断到一小时?例如,CONCAT(LEFT(timestamp, 13), ':xx')
会产生2016-10-16 20:xx
。LIMIT
,也看不到ORDER BY
。会继续如此吗?这些建议将以各种方式提供帮助。一旦修复了大部分内容,我们就可以讨论如何使用汇总表来获得10倍的加速。
答案 2 :(得分:1)
此答案讨论了如何构建汇总表。
INSERT INTO Summary
(hr, sensor_type, sensor_name, num_readings,
sum_reading, min_reading, max_reading)
SELECT
FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600) - 1)), -- start of prev hour
sensor_type,
sensor_name,
COUNT(*), -- how many readings were taken in the hour.
SUM(??), -- maybe this is not practical, since you seem to have pairs of readings
MAX(massimo),
MIN(minimo)
FROM DataM1
WHERE `timestamp` >= FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600) - 1))
AND `timestamp` < FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600)));
每小时,填充类似
sensor_name
这假设你每分钟都在读数。 如果你每小时只读一次读数,那么总结到小时就更有意义了。
更多讨论:Summary Tables。
为了更加健壮,汇总INSERT-SELECT可能需要更复杂 - 如果你错过了一个小时怎么办? (以及其他可能出错的事情。)
警告:这个汇总表比阅读&#34;事实&#34;表,但它只能显示基于整个小时的时间范围。如果你需要&#34;最后60分钟&#34;,你需要去事实表。
另一个注意事项: 应该规范化事件中SELECT sensor_type, sensor_name, data_type,
MAX(massimo) as massimo,
MIN(minimo) as minimo
FROM Summary
WHERE timestamp >= CURRENT_DATE() - INTERVAL 1 DAY
AND timestamp < CURRENT_DATE()
GROUP BY sensor_type, sensor_name, data_type;
之类的笨重,重复,但你可以(也许应该) denormalize < / strong>构建Summary表时。 (我在这个例子中省略了那些步骤。)
获取昨天的数据:
WHERE timestamp >= '2016-06-01'
AND timestamp < '2016-06-01' + INTERVAL 1 MONTH
整个六月:
sum_reading
注意:获得平均值的简单方法是平均平均值。但数学上正确的方法是求和,除以计数之和。因此,我加入了num_readings
和mydomain.com/texas/map-of-texas/
。另一方面,在对天气读数进行平均处理时,通常会得到每天的平均值,然后是平均天数。我会留给你决定什么是对的&#39;。
答案 3 :(得分:-1)
我认为就是这样的用例,当你拥有如此多的数据时,最好的解决方案可能是使用noSQL数据库,并在存储数据之前执行一些聚合。您可以查看Google Big Query和Cloud Data Flow
但是,为了回答你的问题,我会使用我的系统所需的最小粒度预先计算数据聚合(你可以每10分钟计算一次聚合),然后你就可以对少量数据执行查询。