鉴于以下数据集与日期表配对:
MembershipId | ValidFromDate | ValidToDate
==========================================
0001 | 1997-01-01 | 2006-05-09
0002 | 1997-01-01 | 2017-05-12
0003 | 2005-06-02 | 2009-02-07
在任何特定日期或时间序列中,有多少Memberships
是开放的?
在询问此问题后here,此答案提供了必要的功能:
select d.[Date]
,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
left join Memberships as m
on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];
虽然评论者评论说当非等值连接时间过长时还有其他方法。
因此,equijoin只有逻辑才能复制上述查询的输出?
从目前为止提供的答案中,我提出了以下内容,该内容在我使用的硬件上优于320万Membership
条记录:
declare @s date = '20160101';
declare @e date = getdate();
with s as
(
select d.[Date] as d
,count(s.MembershipID) as s
from dbo.Dates as d
join dbo.Memberships as s
on d.[Date] = s.ValidFromDateKey
group by d.[Date]
)
,e as
(
select d.[Date] as d
,count(e.MembershipID) as e
from dbo.Dates as d
join dbo.Memberships as e
on d.[Date] = e.ValidToDateKey
group by d.[Date]
),c as
(
select isnull(s.d,e.d) as d
,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
from s
full join e
on s.d = e.d
)
select d.[Date]
,c.c
from dbo.Dates as d
left join c
on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;
接下来,为了将这个聚合分成每天的成分组,我有以下几点,这也表现良好:
declare @s date = '20160101';
declare @e date = getdate();
with s as
(
select d.[Date] as d
,s.MembershipGrouping as g
,count(s.MembershipID) as s
from dbo.Dates as d
join dbo.Memberships as s
on d.[Date] = s.ValidFromDateKey
group by d.[Date]
,s.MembershipGrouping
)
,e as
(
select d.[Date] as d
,e..MembershipGrouping as g
,count(e.MembershipID) as e
from dbo.Dates as d
join dbo.Memberships as e
on d.[Date] = e.ValidToDateKey
group by d.[Date]
,e.MembershipGrouping
),c as
(
select isnull(s.d,e.d) as d
,isnull(s.g,e.g) as g
,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
from s
full join e
on s.d = e.d
and s.g = e.g
)
select d.[Date]
,c.g
,c.c
from dbo.Dates as d
left join c
on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
,c.g
;
任何人都可以改进上述内容吗?
答案 0 :(得分:13)
如果您的大部分会员资格有效期都超过几天,请查看Martin Smith的回答。这种方法可能会更快。
当您使用日历表(DIM.[Date]
)并将其与Memberships
一起加入时,您最终可能会扫描范围的每个日期Memberships
表。即使(ValidFromDate, ValidToDate)
上有索引,它也可能不是非常有用。
很容易扭转它。
仅扫描Memberships
表一次,对于每个成员,使用CROSS APPLY
查找有效的日期。
示例数据
DECLARE @T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);
INSERT INTO @T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');
DECLARE @RangeFrom date = '2006-01-01';
DECLARE @RangeTo date = '2006-12-31';
查询1
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
@T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >= Memberships.ValidFromDate
AND dbo.Calendar.dt <= Memberships.ValidToDate
AND dbo.Calendar.dt >= @RangeFrom
AND dbo.Calendar.dt <= @RangeTo
) AS CA
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE);
实际上并不需要 OPTION(RECOMPILE)
,当我比较执行计划时,我将其包含在所有查询中,以确保在我使用查询时获得最新计划。
当我查看此查询的计划时,我发现Calendar.dt
表中的搜索仅使用ValidFromDate
和ValidToDate
,@RangeFrom
和{{1被推到残留谓词。这不是理想的。优化器不够智能,无法计算最多两个日期(@RangeTo
和ValidFromDate
),并使用该日期作为搜索的起点。
很容易帮助优化者:
查询2
@RangeFrom
在此查询中,搜索是最佳的,并且不会读取稍后可能被丢弃的日期。
最后,您可能不需要扫描整个SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
@T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >=
CASE WHEN Memberships.ValidFromDate > @RangeFrom
THEN Memberships.ValidFromDate
ELSE @RangeFrom END
AND dbo.Calendar.dt <=
CASE WHEN Memberships.ValidToDate < @RangeTo
THEN Memberships.ValidToDate
ELSE @RangeTo END
) AS CA
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE)
;
表。
我们只需要那些给定的日期范围与成员资格的有效范围相交的行。
查询3
Memberships
时,两个时间间隔
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
@T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >=
CASE WHEN Memberships.ValidFromDate > @RangeFrom
THEN Memberships.ValidFromDate
ELSE @RangeFrom END
AND dbo.Calendar.dt <=
CASE WHEN Memberships.ValidToDate < @RangeTo
THEN Memberships.ValidToDate
ELSE @RangeTo END
) AS CA
WHERE
Memberships.ValidToDate >= @RangeFrom
AND Memberships.ValidFromDate <= @RangeTo
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE)
;
和[a1;a2]
相交
[b1;b2]
这些查询假设a2 >= b1 and a1 <= b2
表的索引位于Calendar
。
您应该尝试查看哪些索引更适合dt
表。
对于上一个查询,如果表格相当大,则Memberships
和ValidFromDate
上的两个单独索引很可能优于ValidToDate
上的一个索引。
您应该尝试不同的查询,并使用真实数据在真实硬件上衡量它们的性能。性能可能取决于数据分布,有多少成员资格,有效日期,给定范围的宽度或宽度等等。
我建议使用名为SQL Sentry Plan Explorer的强大工具来分析和比较执行计划。这是免费的。它显示了许多有用的统计信息,例如每个查询的执行时间和读取次数。上面的屏幕截图来自此工具。
答案 1 :(得分:6)
假设您的日期维度包含所有会员期间包含的所有日期,您可以使用以下内容。
连接是一个equi连接,所以可以使用散列连接或合并连接,而不仅仅是嵌套循环(它将为每个外部行执行一次内部子树)。
假设(ValidToDate) include(ValidFromDate)
上的索引或反向,可以使用针对Memberships
的单搜索和日期维度的单扫描。下面有一个不到一秒的时间让我将一年的结果与一个拥有320万会员和一般活跃会员资格为140万的表(script)一起返回
DECLARE @StartDate DATE = '2016-01-01',
@EndDate DATE = '2016-12-31';
WITH MD
AS (SELECT Date,
SUM(Adj) AS MemberDelta
FROM Memberships
CROSS APPLY (VALUES ( ValidFromDate, +1),
--Membership count decremented day after the ValidToDate
(DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
WHERE
--Members already expired before the time range of interest can be ignored
ValidToDate >= @StartDate
AND
--Members whose membership starts after the time range of interest can be ignored
ValidFromDate <= @EndDate
GROUP BY Date),
MC
AS (SELECT DD.DateKey,
SUM(MemberDelta) OVER (ORDER BY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
FROM DIM_DATE DD
LEFT JOIN MD
ON MD.Date = DD.DateKey)
SELECT DateKey,
CountOfNonIgnoredMembers AS MembershipCount
FROM MC
WHERE DateKey BETWEEN @StartDate AND @EndDate
ORDER BY DateKey
Demo(使用延长期作为2016年的日历年对示例数据不是很有意思)
答案 2 :(得分:2)
一种方法是首先使用INNER JOIN来查找匹配项,使用COUNT()来计划MemberCount GROUPed BY DateKey,然后使用相同的日期集合使用UNION ALL,并在该投影上使用0来计算成员数对于每个日期。最后一步是SUM()此联合的MemberCount和GROUP BY DateKey。根据要求,这可以避免LEFT JOIN和NOT EXISTS。正如另一位成员指出的那样,这不是一个等同连接,因为我们需要使用一个范围,但我认为它符合你的意图。
这将提供1年的数据,大约有10万个逻辑读取。在具有旋转磁盘的普通笔记本电脑上,从冷缓存中,它在一秒钟内服务一个月(具有正确的计数)。
这是一个创建330万行随机持续时间的示例。底部的查询返回一个月的数据。
--Stay quiet for a moment
SET NOCOUNT ON
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
--Clean up if re-running
DROP TABLE IF EXISTS DIM_DATE
DROP TABLE IF EXISTS FACT_MEMBER
--Date dimension
CREATE TABLE DIM_DATE
(
DateKey DATE NOT NULL
)
--Membership fact
CREATE TABLE FACT_MEMBER
(
MembershipId INT NOT NULL
, ValidFromDateKey DATE NOT NULL
, ValidToDateKey DATE NOT NULL
)
--Populate Date dimension from 2001 through end of 2018
DECLARE @startDate DATE = '2001-01-01'
DECLARE @endDate DATE = '2018-12-31'
;WITH CTE_DATE AS
(
SELECT @startDate AS DateKey
UNION ALL
SELECT
DATEADD(DAY, 1, DateKey)
FROM
CTE_DATE AS D
WHERE
D.DateKey < @endDate
)
INSERT INTO
DIM_DATE
(
DateKey
)
SELECT
D.DateKey
FROM
CTE_DATE AS D
OPTION (MAXRECURSION 32767)
--Populate Membership fact with members having a random membership length from 1 to 36 months
;WITH CTE_DATE AS
(
SELECT @startDate AS DateKey
UNION ALL
SELECT
DATEADD(DAY, 1, DateKey)
FROM
CTE_DATE AS D
WHERE
D.DateKey < @endDate
)
,CTE_MEMBER AS
(
SELECT 1 AS MembershipId
UNION ALL
SELECT MembershipId + 1 FROM CTE_MEMBER WHERE MembershipId < 500
)
,
CTE_MEMBERSHIP
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY NEWID()) AS MembershipId
, D.DateKey AS ValidFromDateKey
FROM
CTE_DATE AS D
CROSS JOIN CTE_MEMBER AS M
)
INSERT INTO
FACT_MEMBER
(
MembershipId
, ValidFromDateKey
, ValidToDateKey
)
SELECT
M.MembershipId
, M.ValidFromDateKey
, DATEADD(MONTH, FLOOR(RAND(CHECKSUM(NEWID())) * (36-1)+1), M.ValidFromDateKey) AS ValidToDateKey
FROM
CTE_MEMBERSHIP AS M
OPTION (MAXRECURSION 32767)
--Add clustered Primary Key to Date dimension
ALTER TABLE DIM_DATE ADD CONSTRAINT PK_DATE PRIMARY KEY CLUSTERED
(
DateKey ASC
)
--Index
--(Optimize in your spare time)
DROP INDEX IF EXISTS SK_FACT_MEMBER ON FACT_MEMBER
CREATE CLUSTERED INDEX SK_FACT_MEMBER ON FACT_MEMBER
(
ValidFromDateKey ASC
, ValidToDateKey ASC
, MembershipId ASC
)
RETURN
--Start test
--Emit stats
SET STATISTICS IO ON
SET STATISTICS TIME ON
--Establish range of dates
DECLARE
@rangeStartDate DATE = '2010-01-01'
, @rangeEndDate DATE = '2010-01-31'
--UNION the count of members for a specific date range with the "zero" set for the same range, and SUM() the counts
;WITH CTE_MEMBER
AS
(
SELECT
D.DateKey
, COUNT(*) AS MembershipCount
FROM
DIM_DATE AS D
INNER JOIN FACT_MEMBER AS M ON
M.ValidFromDateKey <= @rangeEndDate
AND M.ValidToDateKey >= @rangeStartDate
AND D.DateKey BETWEEN M.ValidFromDateKey AND M.ValidToDateKey
WHERE
D.DateKey BETWEEN @rangeStartDate AND @rangeEndDate
GROUP BY
D.DateKey
UNION ALL
SELECT
D.DateKey
, 0 AS MembershipCount
FROM
DIM_DATE AS D
WHERE
D.DateKey BETWEEN @rangeStartDate AND @rangeEndDate
)
SELECT
M.DateKey
, SUM(M.MembershipCount) AS MembershipCount
FROM
CTE_MEMBER AS M
GROUP BY
M.DateKey
ORDER BY
M.DateKey ASC
OPTION (RECOMPILE, MAXDOP 1)
答案 3 :(得分:1)
以下是我用equijoin解决这个问题的方法:
--data generation
declare @Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into @Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')
declare @startDate date, @endDate date
select @startDate = MIN(ValidFromDate), @endDate = max(ValidToDate) from @Membership
--in order to use equijoin I need all days between min date and max date from Membership table (both columns)
;with cte as (
select @startDate [date]
union all
select DATEADD(day, 1, [date]) from cte
where [date] < @endDate
)
--in this query, we will assign value to each day:
--one, if project started on that day
--minus one, if project ended on that day
--then, it's enough to (cumulative) sum all this values to get how many projects were ongoing on particular day
select [date],
sum(case when [DATE] = ValidFromDate then 1 else 0 end +
case when [DATE] = ValidToDate then -1 else 0 end)
over (order by [date] rows between unbounded preceding and current row)
from cte [c]
left join @Membership [m]
on [c].[date] = [m].ValidFromDate or [c].[date] = [m].ValidToDate
option (maxrecursion 0)
这是另一种解决方案:
--data generation
declare @Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into @Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')
;with cte as (
select CAST('2016-01-01' as date) [date]
union all
select DATEADD(day, 1, [date]) from cte
where [date] < '2016-12-31'
)
select [date],
(select COUNT(*) from @Membership where ValidFromDate < [date]) -
(select COUNT(*) from @Membership where ValidToDate < [date]) [ongoing]
from cte
option (maxrecursion 0)
答案 4 :(得分:1)
请注意,我认为当@PittsburghDBA表示当前查询返回错误结果时,它是正确的 会员资格的最后一天不计算在内,因此最终金额低于应有的金额 我已在此版本中更正了它。
这可以改善你的实际进展:
declare @s date = '20160101';
declare @e date = getdate();
with
x as (
select d, sum(c) c
from (
select ValidFromDateKey d, count(MembershipID) c
from Memberships
group by ValidFromDateKey
union all
-- dateadd needed to count last day of membership too!!
select dateadd(dd, 1, ValidToDateKey) d, -count(MembershipID) c
from Memberships
group by ValidToDateKey
)x
group by d
),
c as
(
select d, sum(x.c) over (order by d) as c
from x
)
select d.day, c cnt
from calendar d
left join c on d.day = c.d
where d.day between @s and @e
order by d.day;
答案 5 :(得分:-1)
首先,即使给定日期没有活动成员资格,您的查询也会将“1”作为MembershipCount
。
您应该返回SUM(CASE WHEN m.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount
。
为获得最佳效果,请在Memberships(ValidFromDateKey, ValidToDateKey, MembershipId)
上创建一个索引,在DIM.[Date](CalendarYear, DateKey)
上创建另一个索引。
完成后,最佳查询应为:
DECLARE @CalendarYear INT = 2000
SELECT dim.DateKey, SUM(CASE WHEN con.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount
FROM
DIM.[Date] dim
LEFT OUTER JOIN (
SELECT ValidFromDateKey, ValidToDateKey, MembershipID
FROM Memberships
WHERE
ValidFromDateKey <= CONVERT(DATETIME, CONVERT(VARCHAR, @CalendarYear) + '1231')
AND ValidToDateKey >= CONVERT(DATETIME, CONVERT(VARCHAR, @CalendarYear) + '0101')
) con
ON dim.DateKey BETWEEN con.ValidFromDateKey AND con.ValidToDateKey
WHERE dim.CalendarYear = @CalendarYear
GROUP BY dim.DateKey
ORDER BY dim.DateKey
现在,对于你的上一个问题,等同于等同的查询。
NO WAY 你可以将其重写为非等值连接!
Equijoin并不意味着使用join
sintax。 Equijoin暗示使用equals
谓词,无论sintax是什么。
您的查询会产生范围比较,因此equals
不适用:需要between
或类似内容。