按时间顺序聚合每天,不使用非等值逻辑

时间:2018-03-27 09:07:43

标签: sql sql-server tsql date join

初步问题

鉴于以下数据集与日期表配对:

MembershipId | ValidFromDate | ValidToDate
==========================================
0001         | 1997-01-01    | 2006-05-09
0002         | 1997-01-01    | 2017-05-12
0003         | 2005-06-02    | 2009-02-07

在任何特定日期或时间序列中,有多少Memberships是开放的?

初步答复

在询问此问题后here,此答案提供了必要的功能:

select d.[Date]
      ,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
    left join Memberships as m
        on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];

虽然评论者评论说当非等值连接时间过长时还有其他方法。

跟进

因此,equijoin只有逻辑才能复制上述查询的输出?

迄今为止的进展

从目前为止提供的答案中,我提出了以下内容,该内容在我使用的硬件上优于320万Membership条记录:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
)
,e as
(
    select d.[Date] as d
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
),c as
(
    select isnull(s.d,e.d) as d
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
)
select d.[Date]
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;

接下来,为了将这个聚合分成每天的成分组,我有以下几点,这也表现良好:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,s.MembershipGrouping as g
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
            ,s.MembershipGrouping
)
,e as
(
    select d.[Date] as d
        ,e..MembershipGrouping as g
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
            ,e.MembershipGrouping
),c as
(
    select isnull(s.d,e.d) as d
            ,isnull(s.g,e.g) as g
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
                and s.g = e.g
)
select d.[Date]
    ,c.g
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
        ,c.g
;

任何人都可以改进上述内容吗?

6 个答案:

答案 0 :(得分:13)

如果您的大部分会员资格有效期都超过几天,请查看Martin Smith的回答。这种方法可能会更快。

当您使用日历表(DIM.[Date])并将其与Memberships一起加入时,您最终可能会扫描范围的每个日期Memberships表。即使(ValidFromDate, ValidToDate)上有索引,它也可能不是非常有用。

很容易扭转它。 仅扫描Memberships表一次,对于每个成员,使用CROSS APPLY查找有效的日期。

示例数据

DECLARE @T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);

INSERT INTO @T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');

DECLARE @RangeFrom date = '2006-01-01';
DECLARE @RangeTo   date = '2006-12-31';

查询1

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= Memberships.ValidFromDate
            AND dbo.Calendar.dt <= Memberships.ValidToDate
            AND dbo.Calendar.dt >= @RangeFrom
            AND dbo.Calendar.dt <= @RangeTo
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE);
实际上并不需要

OPTION(RECOMPILE),当我比较执行计划时,我将其包含在所有查询中,以确保在我使用查询时获得最新计划。

当我查看此查询的计划时,我发现Calendar.dt表中的搜索仅使用ValidFromDateValidToDate@RangeFrom和{{1被推到残留谓词。这不是理想的。优化器不够智能,无法计算最多两个日期(@RangeToValidFromDate),并使用该日期作为搜索的起点。

seek 1

很容易帮助优化者:

查询2

@RangeFrom

在此查询中,搜索是最佳的,并且不会读取稍后可能被丢弃的日期。

seek 2

最后,您可能不需要扫描整个SELECT CA.dt ,COUNT(*) AS MembershipCount FROM @T AS Memberships CROSS APPLY ( SELECT dbo.Calendar.dt FROM dbo.Calendar WHERE dbo.Calendar.dt >= CASE WHEN Memberships.ValidFromDate > @RangeFrom THEN Memberships.ValidFromDate ELSE @RangeFrom END AND dbo.Calendar.dt <= CASE WHEN Memberships.ValidToDate < @RangeTo THEN Memberships.ValidToDate ELSE @RangeTo END ) AS CA GROUP BY CA.dt ORDER BY CA.dt OPTION(RECOMPILE) ; 表。 我们只需要那些给定的日期范围与成员资格的有效范围相交的行。

查询3

Memberships

时,两个时间间隔SELECT CA.dt ,COUNT(*) AS MembershipCount FROM @T AS Memberships CROSS APPLY ( SELECT dbo.Calendar.dt FROM dbo.Calendar WHERE dbo.Calendar.dt >= CASE WHEN Memberships.ValidFromDate > @RangeFrom THEN Memberships.ValidFromDate ELSE @RangeFrom END AND dbo.Calendar.dt <= CASE WHEN Memberships.ValidToDate < @RangeTo THEN Memberships.ValidToDate ELSE @RangeTo END ) AS CA WHERE Memberships.ValidToDate >= @RangeFrom AND Memberships.ValidFromDate <= @RangeTo GROUP BY CA.dt ORDER BY CA.dt OPTION(RECOMPILE) ; [a1;a2]相交
[b1;b2]

这些查询假设a2 >= b1 and a1 <= b2 表的索引位于Calendar

您应该尝试查看哪些索引更适合dt表。 对于上一个查询,如果表格相当大,则MembershipsValidFromDate上的两个单独索引很可能优于ValidToDate上的一个索引。

您应该尝试不同的查询,并使用真实数据在真实硬件上衡量它们的性能。性能可能取决于数据分布,有多少成员资格,有效日期,给定范围的宽度或宽度等等。

我建议使用名为SQL Sentry Plan Explorer的强大工具来分析和比较执行计划。这是免费的。它显示了许多有用的统计信息,例如每个查询的执行时间和读取次数。上面的屏幕截图来自此工具。

答案 1 :(得分:6)

假设您的日期维度包含所有会员期间包含的所有日期,您可以使用以下内容。

连接是一个equi连接,所以可以使用散列连接或合并连接,而不仅仅是嵌套循环(它将为每个外部行执行一次内部子树)。

假设(ValidToDate) include(ValidFromDate)上的索引或反向,可以使用针对Memberships搜索和日期维度的扫描。下面有一个不到一秒的时间让我将一年的结果与一个拥有320万会员和一般活跃会员资格为140万的表(script)一起返回

DECLARE @StartDate DATE = '2016-01-01',
        @EndDate   DATE = '2016-12-31';

WITH MD
     AS (SELECT Date,
                SUM(Adj) AS MemberDelta
         FROM   Memberships
                CROSS APPLY (VALUES ( ValidFromDate, +1),
                                    --Membership count decremented day after the ValidToDate
                                    (DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
         WHERE
          --Members already expired before the time range of interest can be ignored
          ValidToDate >= @StartDate
          AND
          --Members whose membership starts after the time range of interest can be ignored
          ValidFromDate <= @EndDate
         GROUP  BY Date),
     MC
     AS (SELECT DD.DateKey,
                SUM(MemberDelta) OVER (ORDER BY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
         FROM   DIM_DATE DD
                LEFT JOIN MD
                  ON MD.Date = DD.DateKey)
SELECT DateKey,
       CountOfNonIgnoredMembers AS MembershipCount
FROM   MC
WHERE  DateKey BETWEEN @StartDate AND @EndDate 
ORDER BY DateKey

Demo(使用延长期作为2016年的日历年对示例数据不是很有意思)

enter image description here

答案 2 :(得分:2)

一种方法是首先使用INNER JOIN来查找匹配项,使用COUNT()来计划MemberCount GROUPed BY DateKey,然后使用相同的日期集合使用UNION ALL,并在该投影上使用0来计算成员数对于每个日期。最后一步是SUM()此联合的MemberCount和GROUP BY DateKey。根据要求,这可以避免LEFT JOIN和NOT EXISTS。正如另一位成员指出的那样,这不是一个等同连接,因为我们需要使用一个范围,但我认为它符合你的意图。

这将提供1年的数据,大约有10万个逻辑读取。在具有旋转磁盘的普通笔记本电脑上,从冷缓存中,它在一秒钟内服务一个月(具有正确的计数)。

这是一个创建330万行随机持续时间的示例。底部的查询返回一个月的数据。

--Stay quiet for a moment
SET NOCOUNT ON
SET STATISTICS IO OFF
SET STATISTICS TIME OFF

--Clean up if re-running
DROP TABLE IF EXISTS DIM_DATE
DROP TABLE IF EXISTS FACT_MEMBER

--Date dimension
CREATE TABLE DIM_DATE
  (
  DateKey DATE NOT NULL 
  )

--Membership fact
CREATE TABLE FACT_MEMBER
  (
  MembershipId INT NOT NULL
  , ValidFromDateKey DATE NOT NULL
  , ValidToDateKey DATE NOT NULL
  )

--Populate Date dimension from 2001 through end of 2018
DECLARE @startDate DATE = '2001-01-01'
DECLARE @endDate DATE = '2018-12-31'
;WITH CTE_DATE AS
(
SELECT @startDate AS DateKey
UNION ALL
SELECT
       DATEADD(DAY, 1, DateKey)
FROM
       CTE_DATE AS D
WHERE
       D.DateKey < @endDate
)
INSERT INTO
  DIM_DATE
  (
  DateKey
  )
SELECT
  D.DateKey
FROM
  CTE_DATE AS D
OPTION (MAXRECURSION 32767)

--Populate Membership fact with members having a random membership length from 1 to 36 months 
;WITH CTE_DATE AS
(
SELECT @startDate AS DateKey
UNION ALL
SELECT
       DATEADD(DAY, 1, DateKey)
FROM
       CTE_DATE AS D
WHERE
       D.DateKey < @endDate
)
,CTE_MEMBER AS
(
SELECT 1 AS MembershipId
UNION ALL
SELECT MembershipId + 1 FROM CTE_MEMBER WHERE MembershipId < 500
)
,
CTE_MEMBERSHIP
AS
(
SELECT
  ROW_NUMBER() OVER (ORDER BY NEWID()) AS MembershipId
  , D.DateKey AS ValidFromDateKey
FROM
  CTE_DATE AS D
  CROSS JOIN CTE_MEMBER AS M
)
INSERT INTO
    FACT_MEMBER
    (
    MembershipId
    , ValidFromDateKey
    , ValidToDateKey
    )
SELECT
    M.MembershipId
    , M.ValidFromDateKey
      , DATEADD(MONTH, FLOOR(RAND(CHECKSUM(NEWID())) * (36-1)+1), M.ValidFromDateKey) AS ValidToDateKey
FROM
    CTE_MEMBERSHIP AS M
OPTION (MAXRECURSION 32767)

--Add clustered Primary Key to Date dimension
ALTER TABLE DIM_DATE ADD CONSTRAINT PK_DATE PRIMARY KEY CLUSTERED
    (
    DateKey ASC
    )

--Index
--(Optimize in your spare time)
DROP INDEX IF EXISTS SK_FACT_MEMBER ON FACT_MEMBER
CREATE CLUSTERED INDEX SK_FACT_MEMBER ON FACT_MEMBER
    (
    ValidFromDateKey ASC
    , ValidToDateKey ASC
    , MembershipId ASC
    )


RETURN

--Start test
--Emit stats
SET STATISTICS IO ON
SET STATISTICS TIME ON

--Establish range of dates
DECLARE
  @rangeStartDate DATE = '2010-01-01'
  , @rangeEndDate DATE = '2010-01-31'

--UNION the count of members for a specific date range with the "zero" set for the same range, and SUM() the counts
;WITH CTE_MEMBER
AS
(
SELECT
    D.DateKey
    , COUNT(*) AS MembershipCount
FROM
    DIM_DATE AS D
    INNER JOIN FACT_MEMBER AS M ON
        M.ValidFromDateKey <= @rangeEndDate
        AND M.ValidToDateKey >= @rangeStartDate
        AND D.DateKey BETWEEN M.ValidFromDateKey AND M.ValidToDateKey
WHERE
    D.DateKey BETWEEN @rangeStartDate AND @rangeEndDate
GROUP BY
    D.DateKey

UNION ALL

SELECT
    D.DateKey
    , 0 AS MembershipCount
FROM
    DIM_DATE AS D
WHERE
    D.DateKey BETWEEN @rangeStartDate AND @rangeEndDate
)
SELECT
    M.DateKey
    , SUM(M.MembershipCount) AS MembershipCount
FROM
    CTE_MEMBER AS M
GROUP BY
    M.DateKey
ORDER BY
    M.DateKey ASC
OPTION (RECOMPILE, MAXDOP 1)

答案 3 :(得分:1)

以下是我用equijoin解决这个问题的方法:

--data generation
declare @Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into @Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')

declare @startDate date, @endDate date
select @startDate =  MIN(ValidFromDate), @endDate = max(ValidToDate) from @Membership
--in order to use equijoin I need all days between min date and max date from Membership table (both columns)
;with cte as (
    select @startDate [date]
    union all
    select DATEADD(day, 1, [date]) from cte
    where [date] < @endDate
)
--in this query, we will assign value to each day:
--one, if project started on that day
--minus one, if project ended on that day
--then, it's enough to (cumulative) sum all this values to get how many projects were ongoing on particular day
select [date],
       sum(case when [DATE] = ValidFromDate then 1 else 0 end +
            case when [DATE] = ValidToDate then -1 else 0 end)
            over (order by [date] rows between unbounded preceding and current row)
from cte [c]
left join @Membership [m]
on [c].[date] = [m].ValidFromDate  or [c].[date] = [m].ValidToDate
option (maxrecursion 0)

这是另一种解决方案:

--data generation
declare @Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into @Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')

;with cte as (
    select CAST('2016-01-01' as date) [date]
    union all
    select DATEADD(day, 1, [date]) from cte
    where [date] < '2016-12-31'
)

select [date],
       (select COUNT(*) from @Membership where ValidFromDate < [date]) - 
       (select COUNT(*) from @Membership where ValidToDate < [date]) [ongoing]
from cte
option (maxrecursion 0)

答案 4 :(得分:1)

请注意,我认为当@PittsburghDBA表示当前查询返回错误结果时,它是正确的 会员资格的最后一天不计算在内,因此最终金额低于应有的金额 我已在此版本中更正了它。

这可以改善你的实际进展:

declare @s date = '20160101';
declare @e date = getdate();

with 
x as (
    select d, sum(c) c
    from (
        select ValidFromDateKey d, count(MembershipID) c
        from Memberships
        group by ValidFromDateKey 

        union all

        -- dateadd needed to count last day of membership too!!
        select dateadd(dd, 1, ValidToDateKey) d, -count(MembershipID) c
        from Memberships
        group by ValidToDateKey 
    )x
    group by d
),
c as
(
    select d, sum(x.c) over (order by d) as c
    from x
)
select d.day, c cnt
from calendar d
left join c on d.day = c.d
where d.day between @s and @e
order by d.day;

答案 5 :(得分:-1)

首先,即使给定日期没有活动成员资格,您的查询也会将“1”作为MembershipCount

您应该返回SUM(CASE WHEN m.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount

为获得最佳效果,请在Memberships(ValidFromDateKey, ValidToDateKey, MembershipId)上创建一个索引,在DIM.[Date](CalendarYear, DateKey)上创建另一个索引。

完成后,最佳查询应为:

DECLARE @CalendarYear INT = 2000

SELECT dim.DateKey, SUM(CASE WHEN con.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount
FROM
    DIM.[Date] dim
        LEFT OUTER JOIN (
            SELECT ValidFromDateKey, ValidToDateKey, MembershipID
            FROM Memberships
            WHERE
                    ValidFromDateKey <= CONVERT(DATETIME, CONVERT(VARCHAR, @CalendarYear) + '1231')
                AND ValidToDateKey   >= CONVERT(DATETIME, CONVERT(VARCHAR, @CalendarYear) + '0101')
        ) con
        ON dim.DateKey BETWEEN con.ValidFromDateKey AND con.ValidToDateKey
WHERE dim.CalendarYear = @CalendarYear
GROUP BY dim.DateKey
ORDER BY dim.DateKey

现在,对于你的上一个问题,等同于等同的查询

NO WAY 你可以将其重写为非等值连接!

Equijoin并不意味着使用join sintax。 Equijoin暗示使用equals谓词,无论sintax是什么。

您的查询会产生范围比较,因此equals不适用:需要between或类似内容。