如何优化多个左联接SQL SELECT查询?

时间:2019-04-12 14:41:32

标签: sql sql-server join select query-optimization

情况:

我们有一个数据库“ base1”,共有600万行数据,其中显示了实际的客户购买和购买日期以及该购买的参数。

CREATE TABLE base1 (
User_id NOT NULL PRIMARY KEY ,
PurchaseDate date,
Parameter1 int,
Parameter2 int,
...
ParameterK int );

还有另一个数据库“ base2”,包含大约9000万行数据,实际上显示的是相同的内容,但不是购买日期,而是使用每周部分(例如:每个客户4年的所有星期-如果N周没有购物,则仍显示该客户)。

CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
...
ParameterN int );

该任务执行以下查询:

-- a = base1 , b , wb%% = base2
--create index idx_uid_purch_date on base1(Users_ID,Purchasedate);
SELECT a.Users_id
-- Checking whether the client will make a purchase in next week and the purchase will be bought on condition
,iif(b.Users_id is not null,1,0) as User_will_buy_next_week
,iif(b.Users_id is not null and b.Parameter1 = 1,1,0) as User_will_buy_on_Condition1
--   about 12 similar iif-conditions
,iif(b.Users_id is not null and (b.Parameter1 = 1 and b.Parameter12 = 1),1,0) 
as User_will_buy_on_Condition13

-- checking on the fact of purchase in the past month, 2 months ago, 2.5 months, etc.
,iif(wb1m.Users_id is null,0,1) as was_buy_1_month_ago
,iif(wb2m.Users_id is null,0,1) as was_buy_2_month_ago
,iif(wb25m.Users_id is null,0,1) as was_buy_25_month_ago
,iif(wb3m.Users_id is null,0,1) as was_buy_3_month_ago
,iif(wb6m.Users_id is null,0,1) as was_buy_6_month_ago
,iif(wb1y.Users_id is null,0,1) as was_buy_1_year_ago

 ,a.[Week_start]
 ,a.[Week_end]

 into base3
 FROM base2 a 

 -- Join for User_will_buy
 left join base1 b
 on a.Users_id =b.Users_id and 
 cast(b.[PurchaseDate] as date)>=DATEADD(dd,7,cast(a.[Week_end] as date)) 
 and cast(b.[PurchaseDate] as date)<=DATEADD(dd,14,cast(a.[Week_end] as date))

 -- Joins for was_buy
 left join base1  wb1m
 on a.Users_id =wb1m.Users_id 
 and cast(wb1m.[PurchaseDate] as date)>=DATEADD(dd,-30-4,cast(a.[Week_end] as date)) 
 and cast(wb1m.[PurchaseDate] as date)<=DATEADD(dd,-30+4,cast(a.[Week_end] as date))

/* 4 more similar joins where different values are added in 
DATEADD (dd, %%, cast (a. [Week_end] as date))
to check on the fact of purchase for a certain period */

 left outer join base1 wb1y
 on a.Users_id =wb1y.Users_id and 
 cast(wb1y.[PurchaseDate] as date)>=DATEADD(dd,-365-4,cast(a.[Week_end] as date)) 
 and cast(wb1y.[PurchaseDate] as date)<=DATEADD(dd,-365+5,cast(a.[Week_end] as date))

由于大量的Joins和相当大的数据库-该脚本运行了大约 24小时,这是一个非常长的时间。

执行计划显示,主要时间花费在“合并联接”上,并查看base1和base2中表的行,并将数据插入到另一个base3表中。

问题:是否可以优化此查询,使其运行更快?

也许使用一个Join代替之类的东西。

请帮助,我不够聪明:(

感谢大家的回答!

UPD:也许使用不同类型的联接(合并,循环或哈希)可能会对我有所帮助,但并不能真正验证这一理论。也许有人可以告诉我是对还是错;)

3 个答案:

答案 0 :(得分:0)

我假设base1表存储了有关当周购物的信息。

如果是这样,则在联接的查询条件中,您可以忽略[PurchaseDate]参数,而用当前日期常量代替。在这种情况下,您的DATEADD函数将应用于当前日期,并且在联接条件下将是常量:

left join base1 b
on a.Users_id =b.Users_id and 
DATEADD(day,-7,GETDATE())>=a.[Week_end] 
and DATEADD(day,-14,GETDATE())<=a.[Week_end]

要使上述查询正确运行,您应将b.[PurchaseDate]限制为当天。

然后,您可以运行另一个查询,查询昨天购买的商品,并用DATEADD校正联接条件中的所有-1常量

依此类推,最多7个查询,或者base1表涵盖的时间跨度。

您还可以按天实现[PurchaseDate]值的分组,重新计算常量并在单个查询中进行所有设置,但是我还不准备花时间自己创建它。 :)

答案 1 :(得分:0)

例如,如果您具有诸如DATEADD(dd,-30-4,cast(a.[Week_end] as date))之类的重复参数,则使其成为SARGable,可以在其上创建索引(SQL Server不能)。 Postgres可以做到这一点:

create index ix_base2__34_days_ago on base2(DATEADD(dd,-30-4, cast([Week_end] as date)))

然后,由于数据库将使用DATEADD(dd,-30-4, cast([Week_end]))上的索引,因此类似以下的表达式将可以保存,因此,如果您具有上面示例中的索引,则类似以下的条件将很快。

and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))

请注意,尽管cast看起来像一个函数,但强制转换为Date仍可产生SARGable表达式,因为SQL Server具有迄今为止对datetime的特殊处理,即使您部分地在datetime字段上搜索,datetime字段上的索引也是SARGable的(仅日期部分)。与部分表达式likewhere lastname LIKE 'Mc%'相似,即使索引用于整个姓氏字段,该表达式也是SARGable。我离题了。

要在SQL Server上某种程度上实现表达式的索引,可以在该表达式上创建一个计算列。例如,

CREATE TABLE base2 (
  Users_id NOT NULL PRIMARY KEY ,
  Week_start date ,
  Week_end date,
  Parameter1 int,
  Parameter2 int,
  Thirty4DaysAgo as DATEADD(dd,-30-4, cast([Week_end] as date))
)

..然后在该列上创建索引:

create index ix_base2_34_days_ago on base2(Thirty4DaysAgo)

然后将您的表情更改为:

and cast(wb1m.[PurchaseDate] as date) >= a.Thirty4DaysAgo

这就是我之前所建议的,将旧表达式更改为使用计算列。但是,在进一步搜索时,您似乎可以保留原始代码,因为SQL Server可以智能地将表达式与计算列匹配,并且如果该列上有索引,则表达式将是SARGable。因此,您的DBA可以优化后台的操作,并且原始代码可以在不更改任何代码的情况下进行优化运行。因此,无需更改以下内容,它将是SARGable的(允许您的DBA为dateadd(recurring parameters here)表达式创建一个计算列,并在其上应用索引):

and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))

唯一的缺点(与Postgres相比)是,使用SQL Server时,表上仍然有悬空的计算列:)

好读:https://littlekendra.com/2016/03/01/sql-servers-year-function-and-index-performance/

答案 2 :(得分:0)

您希望结果中有所有9000万个base2行,每个行都包含有关base1数据的附加信息。因此,DBMS必须做的是在base2上进行全表扫描,并快速在base1中找到相关的行。

带有EXISTS子句的查询看起来像这样:

select
  b2.users_id,
  b2.week_start,
  b2.week_end,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_next_week,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.parameter1 = 1
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_on_condition1,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.parameter1 = 1
    and b1.parameter2 = 1
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_on_condition13,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.purchasedate between dateadd(day, -30-4, cast(b2.week_end as date))
                            and dateadd(day, -30+4, cast(b2.week_end as date))´
  ) then 1 else 0 end as was_buy_1_month_ago,
  ...
from base2 b2;

我们很容易看到这将花费很长时间,因为必须按base2行检查所有条件。那是9百万次7次查找。我们唯一可以做的就是提供索引,希望查询能从中受益。

create index idx1 on base1 (users_id, purchasedate, parameter1, parameter2);

我们可以添加更多索引,因此DBMS可以根据选择性在它们之间进行选择。稍后我们可以检查是否使用了它们,并在未使用时将其丢弃。

create index idx2 on base1 (users_id, parameter1, purchasedate);
create index idx3 on base1 (users_id, parameter1, parameter2, purchasedate);
create index idx4 on base1 (users_id, parameter2, parameter1, purchasedate);