情况:
我们有一个数据库“ base1”,共有600万行数据,其中显示了实际的客户购买和购买日期以及该购买的参数。
CREATE TABLE base1 (
User_id NOT NULL PRIMARY KEY ,
PurchaseDate date,
Parameter1 int,
Parameter2 int,
...
ParameterK int );
还有另一个数据库“ base2”,包含大约9000万行数据,实际上显示的是相同的内容,但不是购买日期,而是使用每周部分(例如:每个客户4年的所有星期-如果N周没有购物,则仍显示该客户)。
CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
...
ParameterN int );
该任务执行以下查询:
-- a = base1 , b , wb%% = base2
--create index idx_uid_purch_date on base1(Users_ID,Purchasedate);
SELECT a.Users_id
-- Checking whether the client will make a purchase in next week and the purchase will be bought on condition
,iif(b.Users_id is not null,1,0) as User_will_buy_next_week
,iif(b.Users_id is not null and b.Parameter1 = 1,1,0) as User_will_buy_on_Condition1
-- about 12 similar iif-conditions
,iif(b.Users_id is not null and (b.Parameter1 = 1 and b.Parameter12 = 1),1,0)
as User_will_buy_on_Condition13
-- checking on the fact of purchase in the past month, 2 months ago, 2.5 months, etc.
,iif(wb1m.Users_id is null,0,1) as was_buy_1_month_ago
,iif(wb2m.Users_id is null,0,1) as was_buy_2_month_ago
,iif(wb25m.Users_id is null,0,1) as was_buy_25_month_ago
,iif(wb3m.Users_id is null,0,1) as was_buy_3_month_ago
,iif(wb6m.Users_id is null,0,1) as was_buy_6_month_ago
,iif(wb1y.Users_id is null,0,1) as was_buy_1_year_ago
,a.[Week_start]
,a.[Week_end]
into base3
FROM base2 a
-- Join for User_will_buy
left join base1 b
on a.Users_id =b.Users_id and
cast(b.[PurchaseDate] as date)>=DATEADD(dd,7,cast(a.[Week_end] as date))
and cast(b.[PurchaseDate] as date)<=DATEADD(dd,14,cast(a.[Week_end] as date))
-- Joins for was_buy
left join base1 wb1m
on a.Users_id =wb1m.Users_id
and cast(wb1m.[PurchaseDate] as date)>=DATEADD(dd,-30-4,cast(a.[Week_end] as date))
and cast(wb1m.[PurchaseDate] as date)<=DATEADD(dd,-30+4,cast(a.[Week_end] as date))
/* 4 more similar joins where different values are added in
DATEADD (dd, %%, cast (a. [Week_end] as date))
to check on the fact of purchase for a certain period */
left outer join base1 wb1y
on a.Users_id =wb1y.Users_id and
cast(wb1y.[PurchaseDate] as date)>=DATEADD(dd,-365-4,cast(a.[Week_end] as date))
and cast(wb1y.[PurchaseDate] as date)<=DATEADD(dd,-365+5,cast(a.[Week_end] as date))
由于大量的Joins和相当大的数据库-该脚本运行了大约 24小时,这是一个非常长的时间。
执行计划显示,主要时间花费在“合并联接”上,并查看base1和base2中表的行,并将数据插入到另一个base3表中。
问题:是否可以优化此查询,使其运行更快?
也许使用一个Join代替之类的东西。
请帮助,我不够聪明:(
感谢大家的回答!
UPD:也许使用不同类型的联接(合并,循环或哈希)可能会对我有所帮助,但并不能真正验证这一理论。也许有人可以告诉我是对还是错;)
答案 0 :(得分:0)
我假设base1
表存储了有关当周购物的信息。
如果是这样,则在联接的查询条件中,您可以忽略[PurchaseDate]
参数,而用当前日期常量代替。在这种情况下,您的DATEADD
函数将应用于当前日期,并且在联接条件下将是常量:
left join base1 b
on a.Users_id =b.Users_id and
DATEADD(day,-7,GETDATE())>=a.[Week_end]
and DATEADD(day,-14,GETDATE())<=a.[Week_end]
要使上述查询正确运行,您应将b.[PurchaseDate]
限制为当天。
然后,您可以运行另一个查询,查询昨天购买的商品,并用DATEADD
校正联接条件中的所有-1
常量
依此类推,最多7个查询,或者base1
表涵盖的时间跨度。
您还可以按天实现[PurchaseDate]
值的分组,重新计算常量并在单个查询中进行所有设置,但是我还不准备花时间自己创建它。 :)
答案 1 :(得分:0)
例如,如果您具有诸如DATEADD(dd,-30-4,cast(a.[Week_end] as date))
之类的重复参数,则使其成为SARGable,可以在其上创建索引(SQL Server不能)。 Postgres可以做到这一点:
create index ix_base2__34_days_ago on base2(DATEADD(dd,-30-4, cast([Week_end] as date)))
然后,由于数据库将使用DATEADD(dd,-30-4, cast([Week_end]))
上的索引,因此类似以下的表达式将可以保存,因此,如果您具有上面示例中的索引,则类似以下的条件将很快。
and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))
请注意,尽管cast
看起来像一个函数,但强制转换为Date仍可产生SARGable表达式,因为SQL Server具有迄今为止对datetime的特殊处理,即使您部分地在datetime字段上搜索,datetime字段上的索引也是SARGable的(仅日期部分)。与部分表达式like
,where lastname LIKE 'Mc%'
相似,即使索引用于整个姓氏字段,该表达式也是SARGable。我离题了。
要在SQL Server上某种程度上实现表达式的索引,可以在该表达式上创建一个计算列。例如,
CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
Thirty4DaysAgo as DATEADD(dd,-30-4, cast([Week_end] as date))
)
..然后在该列上创建索引:
create index ix_base2_34_days_ago on base2(Thirty4DaysAgo)
然后将您的表情更改为:
and cast(wb1m.[PurchaseDate] as date) >= a.Thirty4DaysAgo
这就是我之前所建议的,将旧表达式更改为使用计算列。但是,在进一步搜索时,您似乎可以保留原始代码,因为SQL Server可以智能地将表达式与计算列匹配,并且如果该列上有索引,则表达式将是SARGable。因此,您的DBA可以优化后台的操作,并且原始代码可以在不更改任何代码的情况下进行优化运行。因此,无需更改以下内容,它将是SARGable的(允许您的DBA为dateadd(recurring parameters here)
表达式创建一个计算列,并在其上应用索引):
and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))
唯一的缺点(与Postgres相比)是,使用SQL Server时,表上仍然有悬空的计算列:)
好读:https://littlekendra.com/2016/03/01/sql-servers-year-function-and-index-performance/
答案 2 :(得分:0)
您希望结果中有所有9000万个base2行,每个行都包含有关base1数据的附加信息。因此,DBMS必须做的是在base2上进行全表扫描,并快速在base1中找到相关的行。
带有EXISTS
子句的查询看起来像这样:
select
b2.users_id,
b2.week_start,
b2.week_end,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
and dateadd(day, 14, cast(b2.week_end as date))´
) then 1 else 0 end as user_will_buy_next_week,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.parameter1 = 1
and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
and dateadd(day, 14, cast(b2.week_end as date))´
) then 1 else 0 end as user_will_buy_on_condition1,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.parameter1 = 1
and b1.parameter2 = 1
and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
and dateadd(day, 14, cast(b2.week_end as date))´
) then 1 else 0 end as user_will_buy_on_condition13,
case when exists
(
select *
from base1 b1
where b1.users_id = b2.users_id
and b1.purchasedate between dateadd(day, -30-4, cast(b2.week_end as date))
and dateadd(day, -30+4, cast(b2.week_end as date))´
) then 1 else 0 end as was_buy_1_month_ago,
...
from base2 b2;
我们很容易看到这将花费很长时间,因为必须按base2行检查所有条件。那是9百万次7次查找。我们唯一可以做的就是提供索引,希望查询能从中受益。
create index idx1 on base1 (users_id, purchasedate, parameter1, parameter2);
我们可以添加更多索引,因此DBMS可以根据选择性在它们之间进行选择。稍后我们可以检查是否使用了它们,并在未使用时将其丢弃。
create index idx2 on base1 (users_id, parameter1, purchasedate);
create index idx3 on base1 (users_id, parameter1, parameter2, purchasedate);
create index idx4 on base1 (users_id, parameter2, parameter1, purchasedate);