Question

情况：

我们有一个数据库“ base1”，共有600万行数据，其中显示了实际的客户购买和购买日期以及该购买的参数。

CREATE TABLE base1 (
User_id NOT NULL PRIMARY KEY ,
PurchaseDate date,
Parameter1 int,
Parameter2 int,
...
ParameterK int );

还有另一个数据库“ base2”，包含大约9000万行数据，实际上显示的是相同的内容，但不是购买日期，而是使用每周部分（例如：每个客户4年的所有星期-如果N周没有购物，则仍显示该客户）。

CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
...
ParameterN int );

该任务执行以下查询：

-- a = base1 , b , wb%% = base2
--create index idx_uid_purch_date on base1(Users_ID,Purchasedate);
SELECT a.Users_id
-- Checking whether the client will make a purchase in next week and the purchase will be bought on condition
,iif(b.Users_id is not null,1,0) as User_will_buy_next_week
,iif(b.Users_id is not null and b.Parameter1 = 1,1,0) as User_will_buy_on_Condition1
--   about 12 similar iif-conditions
,iif(b.Users_id is not null and (b.Parameter1 = 1 and b.Parameter12 = 1),1,0) 
as User_will_buy_on_Condition13

-- checking on the fact of purchase in the past month, 2 months ago, 2.5 months, etc.
,iif(wb1m.Users_id is null,0,1) as was_buy_1_month_ago
,iif(wb2m.Users_id is null,0,1) as was_buy_2_month_ago
,iif(wb25m.Users_id is null,0,1) as was_buy_25_month_ago
,iif(wb3m.Users_id is null,0,1) as was_buy_3_month_ago
,iif(wb6m.Users_id is null,0,1) as was_buy_6_month_ago
,iif(wb1y.Users_id is null,0,1) as was_buy_1_year_ago

 ,a.[Week_start]
 ,a.[Week_end]

 into base3
 FROM base2 a 

 -- Join for User_will_buy
 left join base1 b
 on a.Users_id =b.Users_id and 
 cast(b.[PurchaseDate] as date)>=DATEADD(dd,7,cast(a.[Week_end] as date)) 
 and cast(b.[PurchaseDate] as date)<=DATEADD(dd,14,cast(a.[Week_end] as date))

 -- Joins for was_buy
 left join base1  wb1m
 on a.Users_id =wb1m.Users_id 
 and cast(wb1m.[PurchaseDate] as date)>=DATEADD(dd,-30-4,cast(a.[Week_end] as date)) 
 and cast(wb1m.[PurchaseDate] as date)<=DATEADD(dd,-30+4,cast(a.[Week_end] as date))

/* 4 more similar joins where different values are added in 
DATEADD (dd, %%, cast (a. [Week_end] as date))
to check on the fact of purchase for a certain period */

 left outer join base1 wb1y
 on a.Users_id =wb1y.Users_id and 
 cast(wb1y.[PurchaseDate] as date)>=DATEADD(dd,-365-4,cast(a.[Week_end] as date)) 
 and cast(wb1y.[PurchaseDate] as date)<=DATEADD(dd,-365+5,cast(a.[Week_end] as date))

由于大量的Joins和相当大的数据库-该脚本运行了大约 24小时，这是一个非常长的时间。

执行计划显示，主要时间花费在“合并联接”上，并查看base1和base2中表的行，并将数据插入到另一个base3表中。

问题：是否可以优化此查询，使其运行更快？

也许使用一个Join代替之类的东西。

请帮助，我不够聪明：（

感谢大家的回答！

UPD：也许使用不同类型的联接（合并，循环或哈希）可能会对我有所帮助，但并不能真正验证这一理论。也许有人可以告诉我是对还是错;）

Answer 1

我假设base1表存储了有关当周购物的信息。

如果是这样，则在联接的查询条件中，您可以忽略[PurchaseDate]参数，而用当前日期常量代替。在这种情况下，您的DATEADD函数将应用于当前日期，并且在联接条件下将是常量：

left join base1 b
on a.Users_id =b.Users_id and 
DATEADD(day,-7,GETDATE())>=a.[Week_end] 
and DATEADD(day,-14,GETDATE())<=a.[Week_end]

要使上述查询正确运行，您应将b.[PurchaseDate]限制为当天。

然后，您可以运行另一个查询，查询昨天购买的商品，并用DATEADD校正联接条件中的所有-1常量

依此类推，最多7个查询，或者base1表涵盖的时间跨度。

您还可以按天实现[PurchaseDate]值的分组，重新计算常量并在单个查询中进行所有设置，但是我还不准备花时间自己创建它。：）

Answer 2

例如，如果您具有诸如DATEADD(dd,-30-4,cast(a.[Week_end] as date))之类的重复参数，则使其成为SARGable，可以在其上创建索引（SQL Server不能）。 Postgres可以做到这一点：

create index ix_base2__34_days_ago on base2(DATEADD(dd,-30-4, cast([Week_end] as date)))

然后，由于数据库将使用DATEADD(dd,-30-4, cast([Week_end]))上的索引，因此类似以下的表达式将可以保存，因此，如果您具有上面示例中的索引，则类似以下的条件将很快。

and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))

请注意，尽管cast看起来像一个函数，但强制转换为Date仍可产生SARGable表达式，因为SQL Server具有迄今为止对datetime的特殊处理，即使您部分地在datetime字段上搜索，datetime字段上的索引也是SARGable的（仅日期部分）。与部分表达式like，where lastname LIKE 'Mc%'相似，即使索引用于整个姓氏字段，该表达式也是SARGable。我离题了。

要在SQL Server上某种程度上实现表达式的索引，可以在该表达式上创建一个计算列。例如，

CREATE TABLE base2 (
  Users_id NOT NULL PRIMARY KEY ,
  Week_start date ,
  Week_end date,
  Parameter1 int,
  Parameter2 int,
  Thirty4DaysAgo as DATEADD(dd,-30-4, cast([Week_end] as date))
)

..然后在该列上创建索引：

create index ix_base2_34_days_ago on base2(Thirty4DaysAgo)

然后将您的表情更改为：

and cast(wb1m.[PurchaseDate] as date) >= a.Thirty4DaysAgo

这就是我之前所建议的，将旧表达式更改为使用计算列。但是，在进一步搜索时，您似乎可以保留原始代码，因为SQL Server可以智能地将表达式与计算列匹配，并且如果该列上有索引，则表达式将是SARGable。因此，您的DBA可以优化后台的操作，并且原始代码可以在不更改任何代码的情况下进行优化运行。因此，无需更改以下内容，它将是SARGable的（允许您的DBA为dateadd(recurring parameters here)表达式创建一个计算列，并在其上应用索引）：

and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))

唯一的缺点（与Postgres相比）是，使用SQL Server时，表上仍然有悬空的计算列：）

好读：https://littlekendra.com/2016/03/01/sql-servers-year-function-and-index-performance/

Answer 3

您希望结果中有所有9000万个base2行，每个行都包含有关base1数据的附加信息。因此，DBMS必须做的是在base2上进行全表扫描，并快速在base1中找到相关的行。

带有EXISTS子句的查询看起来像这样：

select
  b2.users_id,
  b2.week_start,
  b2.week_end,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_next_week,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.parameter1 = 1
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_on_condition1,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.parameter1 = 1
    and b1.parameter2 = 1
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_on_condition13,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.purchasedate between dateadd(day, -30-4, cast(b2.week_end as date))
                            and dateadd(day, -30+4, cast(b2.week_end as date))´
  ) then 1 else 0 end as was_buy_1_month_ago,
  ...
from base2 b2;

我们很容易看到这将花费很长时间，因为必须按base2行检查所有条件。那是9百万次7次查找。我们唯一可以做的就是提供索引，希望查询能从中受益。

create index idx1 on base1 (users_id, purchasedate, parameter1, parameter2);

我们可以添加更多索引，因此DBMS可以根据选择性在它们之间进行选择。稍后我们可以检查是否使用了它们，并在未使用时将其丢弃。

create index idx2 on base1 (users_id, parameter1, purchasedate);
create index idx3 on base1 (users_id, parameter1, parameter2, purchasedate);
create index idx4 on base1 (users_id, parameter2, parameter1, purchasedate);

如何优化多个左联接SQL SELECT查询？

3 个答案: