SQL数据清理-重叠日期范围

时间:2019-04-09 00:04:42

标签: sql sql-server

我希望有人可以帮助我解决此问题的最佳方法。

我们的组织目前使用销售周期来根据零售商的首次发货日期来判断我们的零售商的业绩。这是业务规则:

Nurture Stage - 1st year
Graduate Stage - 2nd year
Ongoing Stage - 3rd year and on
Inactive Stage - stop doing business
Restart Stage - do business with us after an Inactive Stage
Change Owner Stage - sell their business and new owner does business with us

要使这种情况更加复杂,在任何给定时间,任何零售商都不能使用同一类型的程序。因此,如果他们从我们这里购买成品,那么他们也将无法加入自己购买所需材料的计划。

StageNo   ProgramNo   CustomerNo    ProgramType   StageDescription  StartDate   EndDate

CAPS041835  CAP010611   RL023238    Packaged            Nurture     2019-04-04  2019-04-04    
CAPS041836  CAP010611   RL023238    Packaged            Inactive    2019-04-05  2999-01-01
CAPS041837  CAP010612   RL023238    Pre-Made in Bulk    Nurture     2019-04-04  2999-01-01

以上是数据异常的一个示例。 01/01/2999仅表示这是我们ERP中的空白日期。

2019年4月4日,用户创建了打包程序,并决定应该将零售商设置为批量生产而不是打包。

ERP在上一个发票日期结束当前阶段,如果不存在,则将以今天的日期结束当前阶段,并从Today + 1开始启动非活动阶段。

因此,如果我运行分析,则04年4月4日的所有发货都将应用于打包程序和预制程序。

理想情况下,我想彻底摆脱打包程序,但是,如果不可能的话,这就是我要清理的程序:

StageNo   ProgramNo   CustomerNo    ProgramType   StageDescription  StartDate   EndDate

CAPS041835  CAP010611   RL023238    Packaged            Nurture     2019-04-04  2019-04-04    
CAPS041836  CAP010611   RL023238    Packaged            Inactive    2019-04-04  2019-04-04
CAPS041837  CAP010612   RL023238    Pre-Made in Bulk    Nurture     2019-04-04  2999-01-01

这样的话,我可以检查并修复它。即使我离开它,也不会是世界末日,因为我可以将发货日期设置为DateTime,然后+1秒,这意味着销售将只属于1个程序。

我首先编写查询以查找日期范围之间的差距,以查找日期差异小于0的任何差距。

这是我到目前为止所拥有的...

WITH CustomerProgram AS 
(
    SELECT
         ROW_NUMBER() OVER (ORDER BY [CustomerNo] ASC, [ProgramGroupId] ASC, [StageStartDate] ASC, [StageEndDate] ASC, [StagePrecedence] ASC, [CustomerProgramStageNo] ASC) AS [RowId]
        ,*
        ,COUNT([CustomerProgramStageNo]) OVER (PARTITION BY [ProgramGroupId]) AS [StageCount]
    FROM
    (
        SELECT
             --RANK() OVER (ORDER BY [CustomerNo] ASC, [ProgramDescription] ASC) AS [ProgramGroupId]
             RANK() OVER (ORDER BY [CustomerNo] ASC, [CustomerProgramNo] ASC) AS [ProgramGroupId]
            ,[CustomerProgramNo]
            ,[CustomerProgramStageNo]
            ,[CustomerNo]
            ,[ProgramCode]
            ,[ProgramStageCode]
            ,[ProgramStageDescription]
            ,CASE [ProgramStageDescription]
                WHEN 'Nurture'          THEN 1
                WHEN 'Graduate'         THEN 2
                WHEN 'Change Ownership' THEN 3
                WHEN 'Restart'          THEN 3
                WHEN 'Ongoing'          THEN 4
                WHEN 'Inactive'         THEN 5
                ELSE NULL
            END                                 AS [StagePrecedence]
            ,CAST([StageStartDate] AS DATETIME) AS [StageStartDate]
            ,CAST([StageEndDate] AS DATETIME)   AS [StageEndDate]
        FROM
            [CustomerProgramAndStage]
    )   CustomerProgram
)

,StagesAndGaps AS 
(
    SELECT
         ROW_NUMBER() OVER (ORDER BY [CustomerNo] ASC, [ProgramGroupId] ASC, [StageStartDate] ASC, [StageEndDate] ASC) AS [RowId]
        ,[ProgramGroupId]
        ,[StageCount]
        ,[CustomerNo]
        ,[DateRangeType]
        ,[StageStartDate]
        ,[StageEndDate]
        ,DATEDIFF(DAY,[StageStartDate],[StageEndDate])  AS [StageDateDayDiff]
        ,DATEDIFF(YEAR,[StageStartDate],[StageEndDate]) AS [StageDateYearDiff]
        ,[StartDateRowId]
        ,[EndDateRowId]
        ,[PreviousProgramCode]
        ,[NextProgramCode]
        ,[PreviousStagePrecedence]
        ,[NextStagePrecedence]
        ,[PreviousStageNo]
        ,[NextStageNo]
    FROM
    (
        SELECT
             [ProgramGroupId]                           AS [ProgramGroupId]
            ,[StageCount]                               AS [StageCount]
            ,[CustomerNo]                               AS [CustomerNo]
            ,[DateRangeType]                            AS [DateRangeType]
            ,ISNULL([StageStartDate],'1800-01-01')      AS [StageStartDate]
            ,ISNULL([StageEndDate],'3999-01-01')        AS [StageEndDate]
            ,ISNULL([StartDateRowId],0)                 AS [StartDateRowId]
            ,ISNULL([EndDateRowId],9999999)             AS [EndDateRowId]
            ,ISNULL([PreviousProgramCode],'Start')      AS [PreviousProgramCode]
            ,ISNULL([NextProgramCode],'End')            AS [NextProgramCode]
            ,ISNULL([PreviousStagePrecedence],0)        AS [PreviousStagePrecedence]
            ,ISNULL([NextStagePrecedence],999)          AS [NextStagePrecedence]
            ,ISNULL([PreviousStageNo],'Start')          AS [PreviousStageNo]
            ,ISNULL([NextStageNo],'End')                AS [NextStageNo]
        FROM
        (
            SELECT --  Gaps include time period before the start of a Program
                 NextStage.[ProgramGroupId]                 AS [ProgramGroupId]
                ,NextStage.[StageCount]                     AS [StageCount]
                ,NextStage.[CustomerNo]                     AS [CustomerNo]
                ,'Gap'                                      AS [DateRangeType]
                ,PreviousStage.[StageEndDate]               AS [StageStartDate]
                ,NextStage.[StageStartDate]                 AS [StageEndDate]
                ,PreviousStage.[RowId]                      AS [StartDateRowId]
                ,NextStage.[RowId]                          AS [EndDateRowId]
                ,PreviousStage.[ProgramCode]                AS [PreviousProgramCode]
                ,NextStage.[ProgramCode]                    AS [NextProgramCode]
                ,PreviousStage.[StagePrecedence]            AS [PreviousStagePrecedence]
                ,NextStage.[StagePrecedence]                AS [NextStagePrecedence]
                ,PreviousStage.[CustomerProgramStageNo]     AS [PreviousStageNo]
                ,NextStage.[CustomerProgramStageNo]         AS [NextStageNo]
            FROM
            (
                SELECT
                     [RowId]
                    ,[ProgramGroupId]
                    ,[StageCount]
                    ,[CustomerProgramStageNo]
                    ,[CustomerNo]
                    ,[ProgramCode]
                    ,[StagePrecedence]
                    ,[StageStartDate]
                FROM
                    CustomerProgram
            )   NextStage    
            LEFT JOIN
            (
                SELECT
                     [RowId]
                    ,[ProgramGroupId]
                    ,[StageCount]
                    ,[CustomerProgramStageNo]
                    ,[CustomerNo]
                    ,[ProgramCode]
                    ,[StagePrecedence]
                    ,[StageEndDate]
                FROM
                    CustomerProgram
            )   PreviousStage
                    ON NextStage.[ProgramGroupId] = PreviousStage.[ProgramGroupId]
                    AND NextStage.[RowId] - 1 = PreviousStage.[RowId]

            UNION

            SELECT --  Gaps include time period after the end of a Program (year 2999 if Stage is active)
                 PreviousStage.[ProgramGroupId]             AS [ProgramGroupId]
                ,PreviousStage.[StageCount]                 AS [StageCount]
                ,PreviousStage.[CustomerNo]                 AS [CustomerNo]
                ,'Gap'                                      AS [DateRangeType]
                ,PreviousStage.[StageEndDate]               AS [StageStartDate]
                ,NextStage.[StageStartDate]                 AS [StageEndDate]
                ,PreviousStage.[RowId]                      AS [StartDateRowId]
                ,NextStage.[RowId]                          AS [EndDateRowId]
                ,PreviousStage.[ProgramCode]                AS [PreviousProgramCode]
                ,NextStage.[ProgramCode]                    AS [NextProgramCode]
                ,PreviousStage.[StagePrecedence]            AS [PreviousStagePrecedence]
                ,NextStage.[StagePrecedence]                AS [NextStagePrecedence]
                ,PreviousStage.[CustomerProgramStageNo]     AS [PreviousStageNo]
                ,NextStage.[CustomerProgramStageNo]         AS [NextStageNo]
            FROM
            (
                SELECT
                     [RowId]
                    ,[ProgramGroupId]
                    ,[StageCount]
                    ,[CustomerProgramStageNo]
                    ,[CustomerNo]
                    ,[ProgramCode]
                    ,[StagePrecedence]
                    ,[StageEndDate]
                FROM
                    CustomerProgram
            )   PreviousStage
            LEFT JOIN
            (
                SELECT
                     [RowId]
                    ,[ProgramGroupId]
                    ,[StageCount]
                    ,[CustomerProgramStageNo]
                    ,[CustomerNo]
                    ,[ProgramCode]
                    ,[StagePrecedence]
                    ,[StageStartDate]
                FROM
                    CustomerProgram
            )   NextStage
                    ON PreviousStage.[ProgramGroupId] = NextStage.[ProgramGroupId]
                    AND PreviousStage.[RowId] + 1 = NextStage.[RowId]

            UNION

            SELECT --  Stage data
                 [ProgramGroupId]           AS [ProgramGroupId]
                ,[StageCount]               AS [StageCount]
                ,[CustomerNo]               AS [CustomerNo]
                ,'Stage'                    AS [DateRangeType]
                ,[StageStartDate]           AS [StageStartDate]
                ,[StageEndDate]             AS [StageEndDate]
                ,[RowId]                    AS [StartDateRowId]
                ,[RowId]                    AS [EndDateRowId]
                ,[ProgramCode]              AS [PreviousProgramCode]
                ,[ProgramCode]              AS [NextProgramCode]
                ,[StagePrecedence]          AS [PreviousStagePrecedence]
                ,[StagePrecedence]          AS [NextStagePrecedence]
                ,[CustomerProgramStageNo]   AS [PreviousStageNo]
                ,[CustomerProgramStageNo]   AS [NextStageNo]
            FROM
                CustomerProgram
        )   StagesAndGaps
    )   StagesAndGaps
)


SELECT 
    *
FROM 
    StagesAndGaps
WHERE 
    [DateRangeType] = 'Gap'
    AND [StageStartDate] NOT IN ('1800-01-01','2999-01-01')
ORDER BY 
    [RowId] ASC

我认为我朝着正确的方向前进,但我也不确定是否有更简单的方法。抱歉,很长的帖子,但是对您的任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:0)

您可以使用PARTITION,ORDER将数据集划分为有序的块,然后识别需要更新/删除的记录,这就是您尝试过的方法。但是,您可以更加精确。

例如,您刚刚使用ORDER BY,它将为您提供:

row_num customer_no阶段stage_startdate 1 1 A 2019-01-01 2 2 B 2019-12-30

在这里,您不能比较row_num 1和2,因为它们都属于两个不同的客户。

因此,首先使用PARTITION划分数据块,然后使用ORDER BY排列数据。

并且,除了更新之外,您还可以标记不需要的记录,然后将其删除。

为此,添加“ to_be_deleted”列以标记需要删除的记录。如果您使用的是SQL Server 2012+,则可以使用PARTITION输出顶部的LEAD()或LAG()轻松填充此“ to_be_deleted”列。 LEAD()或LAG()函数可帮助您与上一行或下一行进行比较。因此,您可以轻松检查重复项,然后对其进行标记,最终将其删除。

对于LEAD(),LAG(),您可以参考此:https://blog.sqlauthority.com/2011/11/15/sql-server-introduction-to-lead-and-lag-analytic-functions-introduced-in-sql-server-2012/

希望这对您有所帮助:)。好的主动性