我们有一个包含网站页面浏览量的表格,例如:
time | page_id
----------|-----------------------------
1256645862| pageA
1256645889| pageB
1256647199| pageA
1256647198| pageA
1256647300| pageB
1257863235| pageA
1257863236| pageC
在我们的生产表中,目前大约有40K行。我们希望每天生成在过去30天,60天和90天内查看的唯一页数。因此,在结果集中,我们可以查找一天,并查看在该日之前的60天内访问了多少唯一页面。
我们能够在MSSQL中使用查询:
SELECT DISTINCT
CONVERT(VARCHAR,P.NDATE,101) AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-29,P.NDATE) AND P.NDATE) AS SUB) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-59,P.NDATE) AND P.NDATE) AS SUB) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-89,P.NDATE) AND P.NDATE) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
注意:因为MSSQL没有FROM_UNIXTIME函数,所以我们添加了用于测试的NDATE列,它只是转换后的time
。生产表中不存在NDATE。
将此查询转换为MySQL会给我们带来“Unknown colum P.time”错误:
SELECT DISTINCT
FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
我理解这是因为我们不能有一个引用外部FROM子句中的表的相关子查询。但是,遗憾的是,我们对如何将此查询转换为在MySQL中工作感到茫然。现在,我们只是从表中返回所有DISTINCT行,并在PHP中对其进行后处理。 40K行需要2-3秒。当我们拥有1000行的100行时,我很担心性能。
可以在MySQL中做到吗?如果是这样,我们是否可以期望它的性能优于我们的PHP后处理解决方案。
更新 这是创建表的查询:
CREATE TABLE `perflog` (
`user_id` VARBINARY( 40 ) NOT NULL ,
`elapsed` float UNSIGNED NOT NULL ,
`page_id` VARCHAR( 255 ) NOT NULL ,
`time` INT( 10 ) UNSIGNED NOT NULL ,
`ip` VARBINARY( 40 ) NOT NULL ,
`agent` VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( `user_id` , `page_id` , `time` , `ip`, `agent` )
) ENGINE MyISAM
到目前为止,我们的生产表有大约40K行!
答案 0 :(得分:1)
注意:我在阅读@astander,@ Donnie,@ longneck的解决方案之后写这篇文章。
我知道性能很重要,但为什么不存储聚合?每行十年的行数是3650行,每行只有几列。
TABLE dimDate (DateKey int (PK), Year int, Day int, DayOfWeek varchar(10), DayInEpoch....)
TABLE AggVisits (DateKey int (PK,FK), Today int, Last30 int, Last60 int, Last90 int)
这样,您只能在一天结束时运行一次查询,仅运行一天。预先计算的聚合是任何高性能分析解决方案(多维数据集)的根源。
<强>更新强>:
您可以通过引入另一列DayInEpoch int
(自1990-01-01以来的日期编号)来加快这些查询的速度。然后,您可以删除所有这些日期/时间转换功能。
答案 1 :(得分:0)
为什么你把子查询埋在这样的第二层?试试这个:
SELECT DISTINCT
FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
答案 2 :(得分:0)
您可以尝试使用一个选择。
仅选择日期和前90天之间的值。
然后在每个fiels中使用case语句检查日期是否介于30,60,90之间。如果大小写为真,则每个字段为1,否则为0,并计算它们。
像
这样的东西SELECT SUM(CASE WHEN p.Date IN 30 PERIOD THEN 1 ELSE 0 END) Cnt30,
SUM(CASE WHEN p.Date IN 60 PERIOD THEN 1 ELSE 0 END) Cnt60,
SUM(CASE WHEN p.Date IN 90 PERIOD THEN 1 ELSE 0 END) Cnt90
FROM Table
WHERE p.Date IN 90 PERIOD
答案 3 :(得分:0)
将子选择更改为连接,如下所示:
select
FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
count(distinct p30.page_id) AS '30D',
count(distinct p60.page_id) AS '60D',
count(distinct p90.page_id) AS '90D'
from
perflog p
join perflog p30 on FROM_UNIXTIME(p30.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
join perflog p60 on FROM_UNIXTIME(p60.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
join perflog p90 on FROM_UNIXTIME(p90.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
然而,由于大量函数会杀死日期列上的任何标记,因此可能会运行缓慢,更好的解决方案可能是:
create temporary table perf_tmp as
select
FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'VIEWDATE',
page_id
from
perflog;
create index perf_dt on perf_tmp (VIEWDATE);
select
VIEWDATE,
count(distinct p30.page_id) AS '30D',
count(distinct p60.page_id) AS '60D',
count(distinct p90.page_id) AS '90D'
from
perf_tmp p
join perf_tmp p30 on p30.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 30 DAY) AND p.VIEWDATE
join perf_tmp p60 on p60.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 60 DAY) AND p.VIEWDATE
join perf_tmp p90 on p90.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 90 DAY) AND p.VIEWDATE;
答案 4 :(得分:0)
这是我用来解决这个问题的PHP。理想情况下,我希望这一切都由MySQL完成(如果可以更快地完成)。我只发布这个作为对任务的进一步澄清:
function getUniqueUsage($field = 'page_id', $since = 90){
//we need to add 90 days onto our date range for the 90-day sum
$sinceSeconds = mktime(0, 0, 0, $m , $d, $y) - (($sinceDays + 90) * (60 * 60 * 24));
//==> omitting mySQL connection details<==
$sql = "SELECT DISTINCT From_unixtime(time,'%Y-%m-%d') AS date, $field FROM perflog WHERE time > $sinceSeconds ORDER BY date" ;
$sql_results = mysql_query($sql);
$results = array();
//all page ids per date (ending-up with only unique date keys)
while ($row = mysql_fetch_assoc($sql_results))
{
$results[$row['date']][] = $row[$field];
}
$sums = array();
//initialize sum array, with only unique dates (days)
foreach (array_keys($results) as $date){
$sums[$date] = array(0,0,0);
}
//calculate the 30/60/90 day unique pages for each day
foreach (array_keys($sums) as $ref_date){
$merges30 = array();
$merges60 = array();
$merges90 = array();
$ref_time = strtotime($ref_date);
$ref_minus_30 = strtotime("-30 Days",$ref_time);
$ref_minus_60 = strtotime("-60 Days",$ref_time);
$ref_minus_90 = strtotime("-90 Days",$ref_time);
foreach ($results as $result_date => $pages){
$compare_time = strtotime($result_date);
if ($compare_time >= $ref_minus_30 && $compare_time <= $ref_time){
$merges30 = array_merge($merges30, $pages);
}
if ($compare_time >= $ref_minus_60 && $compare_time <= $ref_time){
$merges60 = array_merge($merges60, $pages);
}
if ($compare_time >= $ref_minus_90 && $compare_time <= $ref_time){
$merges90 = array_merge($merges90, $pages);
}
}
$sums[$ref_date] = array(count(array_unique($merges30)),count(array_unique($merges60)),count(array_unique($merges90)));
}
//truncate to only specified number of days
return array_slice($sums,-$since, $since, true);
}
正如您所看到的,很多不幸的数组合并和数组唯一性。