我是sql的新手,这个论坛一直是我的生命线。感谢您在这个出色的平台上创建和分享。
我目前正在研究大型数据集,并希望得到一些指导。
数据表(existing_table)有400万行,如下所示:
id date sales_a sales_b sales_c sales_d sales_e
请注意,有多个行具有相同的日期。
我想要做的是在此表格中添加5个列(cumulative_sales_a
,cumulative_sales_b
等),这些列将包含a,b,c等的累计销售数字,直到特定日期(按日期分组)。我使用以下代码执行此操作:
create table new_cumulative
select t.id, t.date, t.sales_a, t.sales_b, t.sales_c, t.sales_d, t.sales_e,
(select sum(x.sales_a) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_a,
(select sum(x.sales_b) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_b,
(select sum(x.sales_c) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_c,
(select sum(x.sales_d) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_d,
(select sum(x.sales_e) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_e
from existing_table t
group by t.id, t.date;
在运行此查询之前,我已在列'id'上创建了一个索引。
虽然我得到了所需的输出,但这个查询花了将近11个小时才完成。
我想知道我是否在这里做错了,以及是否有更好(更快)的方式来运行此类查询。
感谢您的帮助。
答案 0 :(得分:0)
有些查询本质上是昂贵的,需要很长时间才能执行。在这种特殊情况下,您可以避免使用5个子查询:
SELECT a.*, b.cumulative_sales_a, b.cumulative_sales_b, ...
FROM
(
select t.id, t.`date`, t.sales_a, t.sales_b, t.sales_c, t.sales_d, t.sales_e
from existing_table t
GROUP BY t.id,t.`date`
)a
LEFT JOIN
(
select x.id, x.date, sum(x.sales_a) as cumulative_sales_a,
sum(x.sales_b) as cumulative_sales_b, ...
FROM existing_table x
GROUP BY x.id, x.`date`
)b ON (b.id = a.id AND b.`date` <=a.`date`)
这也是昂贵的查询,但它应该有比原始更好的执行计划。另外,我不确定是否
select t.id, t.`date`, t.sales_a, t.sales_b, t.sales_c, t.sales_d, t.sales_e
from existing_table t
GROUP BY t.id,t.`date`
为您提供您想要的内容 - 例如,如果您有5条具有相同ID和日期的记录,它将从这5条记录中的任何一条中获取其他字段(sales_a,sales_b等)的值...
答案 1 :(得分:0)
您可以在一个查询中加入所有mini-select和sum
(select sum(x.sales_a) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_a,
(select sum(x.sales_b) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_b,
(select sum(x.sales_c) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_c,
(select sum(x.sales_d) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_d,
(select sum(x.sales_e) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_e
in
select sum(..),sum(..),sum(...),sum(..),sum(..)
from existing table x
where x.id=t.id and x.date<=t.date
答案 2 :(得分:0)
看起来是MySQL变量查询的绝佳位置。在这种情况下,我会按预期的“ID”和“日期”预先查询所有聚合,以删除重复项,并将一个条目作为一天的总计。获取此结果并按ID和日期排序,以准备下一部分加入“@sqlvariables”版本。
现在,只需按顺序处理它们并继续累积每个ID直到新ID,然后将计数器重置为零,但继续添加相应的“Sales”。处理完每个“记录”后,将@lastID设置为刚刚处理的ID,以便在处理下一行时进行比较,以确定是否继续在同一个人身上,或强制重置为零。
为了帮助优化并确保内部“PreAgg”注册查询,请确保索引(ID,日期)。应该超级快。
SELECT
PreAgg.ID,
PreAgg.`Date`,
PreAgg.SalesA,
PreAgg.SalesB,
PreAgg.SalesC,
PreAgg.SalesD,
PreAgg.SalesE,
@CumulativeA := if( @lastID := PreAgg.ID, @CumulativeA, 0 ) + PreAgg.SalesA as CumulativeA,
@CumulativeB := if( @lastID := PreAgg.ID, @CumulativeB, 0 ) + PreAgg.SalesB as CumulativeB,
@CumulativeC := if( @lastID := PreAgg.ID, @CumulativeC, 0 ) + PreAgg.SalesC as CumulativeC,
@CumulativeD := if( @lastID := PreAgg.ID, @CumulativeD, 0 ) + PreAgg.SalesD as CumulativeD,
@CumulativeE := if( @lastID := PreAgg.ID, @CumulativeE, 0 ) + PreAgg.SalesE as CumulativeE,
@lastID := PreAgg.ID as dummyPlaceholder
from
( select
t.id,
t.`date`,
SUM( t.sales_a ) SalesA,
SUM( t.sales_b ) SalesB,
SUM( t.sales_c ) SalesC,
SUM( t.sales_d ) SalesD,
SUM( t.sales_e ) SalesE
from
existing_Table t
group by
t.id,
t.`date`
order by
t.id,
t.`date` ) PreAgg,
( select
@lastID := 0,
@CumulativeA := 0,
@CumulativeB := 0,
@CumulativeC := 0,
@CumulativeD := 0,
@CumulativeE := 0 ) sqlvars