我的查询运行速度太慢。我不确定我应该提供的所有信息是为了让您轻松帮助我,但我会抓住它然后添加更多,当你的大脑不可避免地要求的东西,我要么没有'我认为包括或不知道是什么。
我想识别2006年首次购买的客户(但仅使用他们的部分地址 - 容纳家庭和企业)。
我的第一次尝试是:
select
distinct a.line1 + '|' + substring(a.zip,1,5)
from
registrations r
join customers c on r.custID = c.id
join addresses a on c.addressID = a.id
where year(r.purchaseDate) = 2006
and a.line1 + '|' + substring(a.zip,1,5) not in (
select
distinct a.line1 + '|' + substring(a.zip,1,5)
from
registrations r
join customers c on r.custID = c.id
join addresses a on c.addressID = a.id
where
year(r.purchaseDate) < 2006
)
当它运行时间过长时,我切换了一个NOT EXISTS(我不太舒服,但愿意尝试),如
select
distinct a.line1 + '|' + substring(a.zip,1,5)
from
registrations r
join customers c on r.custID = c.id
join addresses a on c.addressID = a.id
where
year(r.purchaseDate) = 2006
and not exists (
select
1
from
registrations r
join customers c on r.custID = c.id
join addresses ia on c.addressID = ia.id
where
ia.line1 + '|' + substring(ia.zip,1,5) = a.line1 + '|' + substring(a.zip,1,5) and
year(r.purchaseDate) < 2006
)
group by
a.line1 + '|' + substring(a.zip,1,5)
但它也运行得太久了。像17小时没有结果太长了。我认为首先要考虑的是我的SQL可能是错误的或次优的,但如果不是这样,我还想给你足够的信息来考虑环境。
所以,诊断信息。你可能不在乎,但以防万一:它运行在带有四个四核和20 GB RAM的G6服务器上;每个查询仅限于占用四个处理器,以保持Web服务器请求的性能;当我运行此查询时,由于死锁,我们正在清除其他大型导入和报告,但Web服务器面向客户并且无法停止。)大致有:1500万注册,1100万客户和8.6百万个地址。我重建了所有索引只是为了确保碎片不是问题。但是,我不太确定如何正确索引,所以我完全接受这是一个问题 - 这些索引中的一些是由于我的未来而且有些是MS分析工具之一给我的脚本提高性能。我也不确定如何向你传达索引信息,所以我只给出创建脚本:
ALTER TABLE [dbo].[registrations] ADD CONSTRAINT [PK_flatRegistrations_1] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
ALTER TABLE [dbo].[customers] ADD CONSTRAINT [PK_flatCustomers_1] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
ALTER TABLE [dbo].[addresses] ADD CONSTRAINT [PK_addresses] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [addresses] ON [dbo].[addresses]
(
[line1] ASC,
[line2] ASC,
[city] ASC,
[state] ASC,
[zip] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [deliverable] ON [dbo].[addresses]
(
[addressDeliverable] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [_dta_index_addresses_5_1543676547__K9_K1_6] ON [dbo].[addresses]
(
[addressDeliverable] ASC,
[ID] ASC
)
INCLUDE ( [zip]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [_dta_index_addresses_5_1543676547__K1_K9_6] ON [dbo].[addresses]
(
[ID] ASC,
[addressDeliverable] ASC
)
INCLUDE ( [zip]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [_dta_index_addresses_5_1543676547__K1_6] ON [dbo].[addresses]
(
[ID] ASC
)
INCLUDE ( [zip]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
非常感谢您的光临!
答案 0 :(得分:1)
我的第一次尝试将是替换:
year(r.purchaseDate) = 2006
使用:
r.purchaseDate BETWEEN '2006-01-01' and '2006-12-31 23:59:59'
以及year(r.purchaseDate) < 2006
和r.purchaseDate < '2006-01-01'
。
并确保purchaseDate
上有索引。
接下来(如果你有足够的资源来运行它):
-- create temporary table to prepare data
CREATE TABLE #addrs (yearr int, pattern varchar(100)) -- depends on a.line1 length
-- calculate all patterns for purchase before 1st Jan 2007
INSERT INTO
#addrs (yearr, pattern)
SELECT
YEAR(r.purchaseDate),
a.line1 + '|' + substring(a.zip,1,5)
from
registrations r
join customers c on r.custID = c.id
join addresses a on c.addressID = a.id
where
r.purchaseDate < `2007-01-01`
-- optionally, but could be useful in query below
CREATE INDEX idx_temp ON #addrs (pattern, yearr)
-- original query rewritten
SELECT
DISTINCT pattern
FROM
#addrs a
WHERE
a.yearr = 2006
and not exists (
select top 1 1
from
#addrs aa
where
aa.pattern = a.pattern
and aa.yearr < 2006
)
第二个解决方案可能有一些拼写错误,无法从第一次尝试编译。 这只是一个想法。
答案 1 :(得分:1)
我认为你的Not Exists子查询的表别名是错误的。试试这个:
select r.custID,
a.line1 + '|' + substring(a.zip,1,5)
from registrations r
join customers c on r.custID = c.id
join addresses a on c.addressID = a.id
where r.purchaseDate between '2006-01-01' and '2006-12-31'
and not exists (
select 1
from registrations ir
join customers ic on ir.custID = ic.id
join addresses ia on ic.addressID = ia.id
where ia.line1 = a.line1
and substring(ia.zip,1,5) = substring(a.zip,1,5)
and ir.purchaseDate < '2006-12-31'
)
答案 2 :(得分:0)
SubString(A.zip,1,5)必须导致表扫描。这是一次性查询吗?如果是这样,请获取以下查询的结果并将其存储在新表中。在AddressToCompare和PurchaseDate上创建索引,并针对新表运行后续查询。
Select
R.ID
, R.CustID
, C.AddressID
, A.line1 + '|' + SubString(A.zip, 1, 5) As AddressToCompare
, R.PurchaseDate
From
Registrations R
Inner Join Customers C On R.CustID = C.ID
Inner Join addresses A On C.AddressID = A.ID
Where
R.PurchaseDate <= '2006-12-31'
答案 3 :(得分:0)
首先,您的逻辑是差的,业务和客户移入和移出地址,因此比较地址而不是客户是错误结果的保证。仅仅因为ABC公司在2002年订购的东西并不意味着DEF公司在2006年没有成为第一家公司,因为ABC comapny和DEF公司没有任何关系。如果你需要与同一家公司或家庭有关系的人,然后有一张桌子来正确存放它们,不要依赖不正确的黑客攻击。
假设您无法执行此操作并且这是一个将运行多次的进程,那么您需要在地址表中使用
来保留列line1 + '|' + substring(zip,1,5)
这可以防止你不得不动态计算它。