情况如下:
我有一个SAAS应用程序,它是一个简单的RSS Feed阅读器。我想大多数人都知道这是什么 - 用户订阅RSS提要然后从他们那里阅读项目。没什么新鲜的。一个Feed可以有很多订阅者。
我已经为用户实施了一些统计信息,但我认为我没有选择正确的方法,因为随着用户和Feed数量的增加,事情逐渐变慢。
这就是我现在正在做的事情:
每小时获取每个Feed的文章总数:
SELECT COUNT(*) FROM articles WHERE feed_id=?
获取上一个值来计算增量(这有点慢):
SELECT value FROM feeds_stats WHERE feed_id=? AND name='total_articles' ORDER BY date DESC LIMIT 1
插入新值和delta:
INSERT INTO feeds_stats (date,feed_id,name,value,delta) VALUES ('".date("Y-m-d H:i:s",$global_timestamp)."','".$feed_id','total_articles','".$value."','".($value-$old_value)."')
为每个用户提供他的Feed和每个Feed获取他已阅读的文章数量:
SELECT COUNT(*) FROM users_articles ua JOIN articles a ON a.id=ua.article_id WHERE a.feed_id='%s' AND ua.user_id='%s' AND ua.read=1
users_articles是一个表,用于保存每个用户的每篇文章的阅读状态
然后再次获得delta:
SELECT value FROM users_feeds_stats WHERE user_id='?' AND feed_id='?' AND name='total_reads' ORDER BY date DESC LIMIT 1
并插入新值+ delta:
INSERT INTO users_feeds_stats (date,user_id,feed_id,name,value,delta) VALUES ('".date("Y-m-d H:i:s",$global_timestamp)."','".$user_id."','".$feed_id."','total_reads','".$value."','".($value-$old_value)."')
当处理完用户的所有Feed后,汇总部分:
这有点棘手,我认为这里应该有很多优化空间。 这是PHP中的实际聚合函数:
<?php
function aggregate_user_stats($user_id=false,$feed_id=false){
global $global_timestamp;
// defined dimensions
$feed_types[0] = array("days_back" => 31, "group_by" => "DATE_FORMAT(date, '%Y-%m-%d')");
$feed_types[1] = array("days_back" => 31, "group_by" => "WEEKDAY(date)+1");
$feed_types[2] = array("days_back" => 31, "group_by" => "HOUR(date)");
if($user_id){
$where = " WHERE id=".$user_id;
}
$feed_where = "";
$getusers = mysql_query("SELECT id FROM users".$where)or die(__LINE__." ".mysql_error());
while($user = mysql_fetch_assoc($getusers)){
if($feed_id){
$feed_where = " AND feed_id=".$feed_id;
}
$user_feeds = array();
$getfeeds = mysql_query("SELECT feed_id FROM subscriptions WHERE user_id='".$user["id"]."' AND active=1".$feed_where)or die(__LINE__." ".mysql_error());
while($row = mysql_fetch_assoc($getfeeds)){
foreach($feed_types as $tab => $type){
$getdata = mysql_query("
SELECT ".$type["group_by"]." AS date, name, SUM(delta) AS delta FROM feeds_stats WHERE feed_id = '".$row["feed_id"]."' AND name='total_articles' AND date > DATE_SUB(NOW(), INTERVAL ".$type["days_back"]." DAY) GROUP BY name, ".$type["group_by"]."
UNION
SELECT ".$type["group_by"]." AS date, name, SUM(delta) AS delta FROM users_feeds_stats WHERE user_id = '".$user["id"]."' AND feed_id = '".$row["feed_id"]."' AND name='total_reads' AND date > DATE_SUB(NOW(), INTERVAL ".$type["days_back"]." DAY) GROUP BY name, ".$type["group_by"]."
")or die(__LINE__." ".mysql_error());
$data = array();
while($row = mysql_fetch_assoc($getdata)){
$data[$row["date"]][$row["name"]] = $row["delta"];
}
if(count($data)){
db_start_trx();
mysql_query("DELETE FROM stats_feeds_over_time WHERE feed_id='".$row["feed_id"]."' AND user_id='".$user["id"]."' AND tab='".$tab."'")or die(__LINE__." ".mysql_error());
foreach($data as $time => $keys){
mysql_query("REPLACE INTO stats_feeds_over_time (feed_id,user_id,tab,date,total_articles,total_reads,total_favs) VALUES ('".$row["feed_id"]."','".$user["id"]."','".$tab."','".$time."','".$keys["total_articles"]."','".$keys["total_reads"]."','".$keys["total_favs"]."')")or die(__LINE__." ".mysql_error());
}
db_commit_trx();
}
}
}
}
}
一些注意事项:
编辑:以下是所涉及表格的DDL:
CREATE TABLE `articles` (
`id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`feed_id` INTEGER(11) UNSIGNED NOT NULL,
`date` INTEGER(10) UNSIGNED NOT NULL,
`date_updated` INTEGER(11) UNSIGNED NOT NULL,
`title` VARCHAR(1000) COLLATE utf8_general_ci NOT NULL DEFAULT '',
`url` VARCHAR(2000) COLLATE utf8_general_ci NOT NULL DEFAULT '',
`author` VARCHAR(200) COLLATE utf8_general_ci NOT NULL DEFAULT '',
`hash` CHAR(32) COLLATE utf8_general_ci NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `feed_id_hash` (`feed_id`, `hash`),
KEY `date` (`date`),
KEY `url` (`url`(255))
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';
CREATE TABLE `users_articles` (
`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`user_id` INTEGER(11) UNSIGNED NOT NULL,
`article_id` INTEGER(11) UNSIGNED NOT NULL,
`subscription_id` INTEGER(11) UNSIGNED NOT NULL,
`read` TINYINT(4) UNSIGNED NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `user_id` (`user_id`, `article_id`),
KEY `article_id` (`article_id`),
KEY `subscription_id` (`subscription_id`)
)ENGINE=InnoDB
CHECKSUM=1 AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';
CREATE TABLE `feeds_stats` (
`id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`feed_id` INTEGER(11) UNSIGNED NOT NULL,
`date` DATETIME NOT NULL,
`name` VARCHAR(50) COLLATE utf8_general_ci NOT NULL DEFAULT '',
`value` INTEGER(11) NOT NULL,
`delta` INTEGER(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `name` (`name`),
KEY `feed_id` (`feed_id`),
KEY `date` (`date`)
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';
CREATE TABLE `users_feeds_stats` (
`id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` INTEGER(11) UNSIGNED NOT NULL DEFAULT '0',
`feed_id` INTEGER(11) UNSIGNED NOT NULL,
`date` DATETIME NOT NULL,
`name` VARCHAR(50) COLLATE utf8_general_ci NOT NULL DEFAULT '',
`value` INTEGER(11) NOT NULL,
`delta` INTEGER(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `name` (`name`),
KEY `feed_id` (`feed_id`),
KEY `user_id` (`user_id`),
KEY `date` (`date`)
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';
CREATE TABLE `stats_feeds_over_time` (
`feed_id` INTEGER(11) UNSIGNED NOT NULL,
`user_id` INTEGER(11) NOT NULL,
`tab` INTEGER(11) NOT NULL,
`date` VARCHAR(30) COLLATE utf8_general_ci NOT NULL DEFAULT '',
`total_articles` DOUBLE(9,2) UNSIGNED NOT NULL,
`total_reads` DOUBLE(9,2) UNSIGNED NOT NULL,
`total_favs` DOUBLE(9,2) UNSIGNED NOT NULL,
PRIMARY KEY (`feed_id`, `user_id`, `tab`, `date`)
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';
在聚合函数的末尾,表stats_feeds_over_time中有一个REPLACE。此表仅包含将显示在图表上的记录,因此实际的图形处理过程不会涉及繁重的查询。
最后,这是由此产生的图表:
如果有人指出我在哪里以及如何优化这个解决方案的正确方向,我会很高兴,即使这意味着放弃MySQL进行统计。
我对RRDTool有很长的经验,但是由于“一天中的时间”,“星期几”聚合,情况有所不同。
答案 0 :(得分:1)
我不知道您希望针对可能在同一组表上运行的其他查询进行优化的查询有多重要。我将假设您希望首先优化这些查询。
看到所有查询都使用feed_id
作为WHERE
谓词,我会尝试对该列上的articles
表进行分区:
CREATE TABLE `articles` (
`id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`feed_id` INTEGER(11) UNSIGNED NOT NULL,
-- etc.
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT=''
PARTITION BY KEY(feed_id)
PARTITIONS 10;
可以根据您的需要调整分区数量(10
以上),但必须高于1才能产生任何影响。您可能希望使用更大的数字来更快地进行选择查询。但是,此设备会降低任何不依赖于feed_id
的查询。
相同的过程可以应用于其他表,其中列通常用作查询中的判别式。
但是,由于您对所有Feed执行了前两个查询,因此可以按以下步骤重写它们:
SELECT feed_id, COUNT(feed_id)
FROM articles
GROUP BY feed_id
SELECT feed_id, value
FROM feeds_stats
WHERE name='total_articles'
GROUP BY feed_id
ORDER BY date DESC
这两个都会检索所有Feed的结果,这使您无需为每个Feed执行查询。使用这些查询会使分区计数器生效,因此您必须在两者之间进行选择。
分区的好处:任何区分feed_id
(或用于分区的任何其他列)的特定值的查询都会看到显着的提升。不好的一点是,常规查询会变慢。
第二个解决方案的优点是它不会对其他查询产生任何影响。