Question

我有一张这样的桌子：

sliced_data_frame['column5']

我需要创建一个查询，如下所示：

假设我在2019年4月运行它，那么输出应该像这样

 id  col2   modified date
 1  red     1/7/2019
 1  green   2/7/2019
 1  blue    3/7/2019
 2  green   1/12/2019
 2  blue    3/02/2019
 2  red     4/19/2019
 3  red     12/12/2018
 3  green   02/10/2019

因此，基本上我需要知道每个ID每月第一天的col2值。例如：对于ID'1'，表1中没有col2的最后一个值，因为它在1月7日进行了修改。因此col4的值在第二个表中为NULL。但是在2月1日，它将显示为红色，因为它是最近的值那个日期。同样，此逻辑对其他ID的工作方式。我们需要在每个月的1号追溯每个id的col2的最新值。

我尝试了多种方法，但无法一次处理所有情况。

Answer 1

Rextester DEMO:

您的问题有很多未知数：我们怎么知道仅限制4个月？在给定的月份中可能会有多种颜色变化，您想列出每个吗？因此，我根据您定义的预期结果做出了一些假设。但是，我相信您对ID 1的第4个条目的预期结果有误。但应该是蓝色的。如果这个假设是错误的，我找不到您预期结果的任何模式。

我认为需要解决CROSS JOIN和OUTER APPLY的知识。知道如何使用recursive CTE（CTE =公用表表达式）来获取某个范围内的日期也可能对IT有所帮助；根据您的长期需求；或者，如注释中所建议的那样，仅提供一个可以从中提取的“日期”表。

对于以下内容：

CTE是您的数据表
Dates是一个表格，其中包含上述4个月中每月的第一天。该数据集可以根据您的数据生成，最后我提供了有关操作的链接。
CROSS联接用于确保每个ID有4个月的时间，以防数据中出现空白
OUTER APPLY用于获取该月开始记录之前的最新颜色变化；如果存在这样的记录。我们使用外部套用，因为此类记录可能不存在。与条目1相同

--CTE is your sample data
with cte (id,  col2,   modifieddate) as 
    (SELECT 1,   'red',  cast('20190107' as date)  UNION ALL
     SELECT 1,   'green',cast('20190207' as Date) UNION ALL
     SELECT 1,   'blue',cast('20190307' as Date) UNION ALL    
     SELECT 2,   'green',cast('20190112' as Date) UNION ALL   
     SELECT 2,   'blue',cast('20190302' as Date) UNION ALL    
     SELECT 2,   'red',cast('20190419' as Date) UNION ALL     
     SELECT 3,   'red',cast('20181212' as Date) UNION ALL     
     SELECT 3,   'green',cast('20190210' as Date)),
-- You didn't define how you know where to start /stop so I just based this on 
-- your results which only went for four months Jan-April of 2019.
  Dates as (SELECT cast('20190101' as date) FirstofMonth  UNION ALL
               SELECT cast('20190201' as date) FirstofMonth  UNION ALL
               SELECT cast('20190301' as date) FirstofMonth  UNION ALL
               SELECT cast('20190401' as date) FirstofMonth )
--This is really the steps needed
--Cross join the dates to your unique ID list so we get 1 date per ID entry  This fills in the missing dates if any exist.
-- Then we use an outer apply to get the most recent color change before that first of month for that ID.  We use a correlated query to only get the most recent color change before the modified date on the record in question.  Thus we have a Top 1 and order by modifed date desc.
     SELECT Z.iD, A.FirstofMonth, Col2 as Col4
     FROM Dates A
     CROSS JOIN (SELECT DISTINCT ID FROM CTE) Z
     OUTER APPLY(SELECT TOP 1 * FROM CTE B
                 WHERE Z.ID = B.ID
                   and B.ModifiedDate<=A.FirstOfMonth
                 ORDER BY B.ModifiedDate desc) X

给我们：

+----+----+---------------------+-------+
|    | iD |    FirstofMonth     | Col4  |
+----+----+---------------------+-------+
|  1 |  1 | 01.01.2019 00:00:00 | NULL  |
|  2 |  1 | 01.02.2019 00:00:00 | red   |
|  3 |  1 | 01.03.2019 00:00:00 | green |
|  4 |  1 | 01.04.2019 00:00:00 | blue  | <-- I think you have a error in expected results.
|  5 |  2 | 01.01.2019 00:00:00 | NULL  |
|  6 |  2 | 01.02.2019 00:00:00 | green |
|  7 |  2 | 01.03.2019 00:00:00 | green |
|  8 |  2 | 01.04.2019 00:00:00 | blue  |
|  9 |  3 | 01.01.2019 00:00:00 | red   |
| 10 |  3 | 01.02.2019 00:00:00 | red   |
| 11 |  3 | 01.03.2019 00:00:00 | green |
| 12 |  3 | 01.04.2019 00:00:00 | green |
+----+----+---------------------+-------+

现在，您可能需要动态日期生成器来获取结果中日期之间每个月的第一天；这些示例可以在其他堆栈问题中找到，例如：Get all dates between two dates in SQL Server

或 https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/f648408f-bf91-4f84-8f69-94df8506d4a5/getting-all-months-start-and-end-dates-between-two-dates?forum=transactsql

两者均使用递归CTE和开始/结束日期来生成范围内的日期。第一个执行所有日期，第二个仅执行月份的第一天和最后一天。如果您使用基表的最小值/最大值作为日期范围，我想第二个就足够了。

Answer 2

您可以执行PARTITION BY并获取每个月中每个id的最新信息，然后将其与包含id值和月份的所有组合的表进行比较； m_id下表。这是demo

WITH 
data AS
(
    SELECT *, 
        DATEADD(d, 1, EOMONTH(modified_date)) AS FirstOfNextMonth,
        RANK() OVER (
            PARTITION BY id, DATEADD(d, 1, EOMONTH(modified_date))
            ORDER BY modified_date DESC
            ) AS rn
    FROM d
),
m_id AS 
(
    SELECT * 
    FROM y, (SELECT DISTINCT id from d) as p
)

SELECT m_id.id, m_id.FOM, latest.col2 
FROM m_id LEFT JOIN
    (
        SELECT * FROM data
        WHERE rn = 1
    ) AS latest
ON m_id.FOM = latest.FirstOfNextMonth AND m_id.id = latest.id

下面返回的内容，您还可以过滤尚未达到的月份（demo）。

    id  FOM                 col2
1   1   01.01.2019 00:00:00 NULL
2   1   01.02.2019 00:00:00 red
3   1   01.03.2019 00:00:00 green
4   1   01.04.2019 00:00:00 blue
5   1   01.05.2019 00:00:00 NULL
6   1   01.06.2019 00:00:00 NULL
7   1   01.07.2019 00:00:00 NULL
8   1   01.08.2019 00:00:00 NULL
9   1   01.09.2019 00:00:00 NULL
10  1   01.10.2019 00:00:00 NULL
11  1   01.11.2019 00:00:00 NULL
12  1   01.12.2019 00:00:00 NULL
13  2   01.01.2019 00:00:00 NULL
14  2   01.02.2019 00:00:00 green
15  2   01.03.2019 00:00:00 NULL
16  2   01.04.2019 00:00:00 blue
17  2   01.05.2019 00:00:00 red
18  2   01.06.2019 00:00:00 NULL
19  2   01.07.2019 00:00:00 NULL
20  2   01.08.2019 00:00:00 NULL
21  2   01.09.2019 00:00:00 NULL
22  2   01.10.2019 00:00:00 NULL
23  2   01.11.2019 00:00:00 NULL
24  2   01.12.2019 00:00:00 NULL
25  3   01.01.2019 00:00:00 red
26  3   01.02.2019 00:00:00 NULL
27  3   01.03.2019 00:00:00 green
28  3   01.04.2019 00:00:00 NULL
29  3   01.05.2019 00:00:00 NULL
30  3   01.06.2019 00:00:00 NULL
31  3   01.07.2019 00:00:00 NULL
32  3   01.08.2019 00:00:00 NULL
33  3   01.09.2019 00:00:00 NULL
34  3   01.10.2019 00:00:00 NULL
35  3   01.11.2019 00:00:00 NULL
36  3   01.12.2019 00:00:00 NULL

Answer 3

这是可行的。我使用了一堆子选择，这使查询有点冗长乏味。它很有可能会简化很多，而我还没有测试性能。我不确定您使用的是哪个版本的SQL，但是较新的版本应具有一些功能，这些功能也可以使您简化。您必须对其进行调整。

我还添加了一个“日期维度”表以简化日期的处理。就像我上面说的，我认为几乎所有数据库都可以从日期维和数字表中受益。关于为什么以及如何的文章不计其数，但我一直是Aaron Bertrand文章的粉丝。

SQL Fiddle （请参阅小提琴进行设置）

查询：

SELECT s5.id, s5.d, s5.col2, s5.col4
FROM (
    SELECT s4.id, s4.d, s4.col2, s4.theDay, s4.theYear
      /* 5. Smear the past data up to the next change. */
      , MAX(s4.col2) OVER (PARTITION BY s4.c1, s4.id) AS col4
    FROM (
        SELECT s1.d, s1.theDay, s1.theYear, s1.id , s2.col2
            /* 4. Identify the records that should be grouped in the window. */
            , COUNT(s2.col2) OVER (ORDER BY s1.id, s1.d) AS c1
        FROM ( 
            /* 1. build the list of days for each id */
            SELECT dd.d, dd.theDay, dd.theYear, s1.id
            FROM datedim dd 
            CROSS APPLY ( SELECT DISTINCT t.id FROM t1 t) s1
        ) s1
        /* 3. JOIN the two together. */
        LEFT OUTER JOIN ( 
            /* 2. Remove dupes from modified records */
            SELECT s3.id, s3.col2, s3.modified
            FROM (
                SELECT t1.id, t1.col2, t1.modified, d1.theMonth AS monthModified
                    /* 2a. Use the ROW_NUMBER() Window Function to number changes in a month. */
                    , ROW_NUMBER() OVER (PARTITION BY t1.id, d1.theYear, d1.theMonth ORDER BY t1.modified DESC) AS rn
                FROM t1
                INNER JOIN datedim d1 ON t1.modified = d1.d
            ) s3
            WHERE s3.rn = 1
        ) s2 ON s1.d = s2.modified
            AND s1.id = s2.id
    ) s4
)s5
/* 6. Filter for only the 1st day of the month. */
WHERE s5.theDay = 1
    AND s5.theYear = year(getDate())
    AND s5.d <= getDate()
/* 6a. Also, if we set a color before 1/1, moving the filter for the date and the year will allow us to carry the color forward from the last time it was set. */
ORDER BY s5.id, s5.d

This Gives You: ：

| id |          d |   col2 |   col4 |
|----|------------|--------|--------|
|  1 | 2019-01-01 | (null) | (null) |
|  1 | 2019-02-01 | (null) |    red |
|  1 | 2019-03-01 | (null) |  green |
|  1 | 2019-04-01 | (null) |   blue |
|  1 | 2019-05-01 | (null) |   blue |
|  1 | 2019-06-01 | (null) |   blue |
|  1 | 2019-07-01 | (null) |   blue |
|  1 | 2019-08-01 | (null) |   blue |
|  2 | 2019-01-01 | (null) | (null) |
|  2 | 2019-02-01 | (null) |  green |
|  2 | 2019-03-01 | (null) |  green |
|  2 | 2019-04-01 | (null) |   blue |
|  2 | 2019-05-01 | (null) |    red |
|  2 | 2019-06-01 | (null) |    red |
|  2 | 2019-07-01 | (null) |    red |
|  2 | 2019-08-01 | (null) |    red |
|  3 | 2019-01-01 | (null) | yellow |
|  3 | 2019-02-01 | (null) | yellow |
|  3 | 2019-03-01 | (null) |  green |
|  3 | 2019-04-01 | (null) |  green |
|  3 | 2019-05-01 | (null) |  green |
|  3 | 2019-06-01 | (null) |  green |
|  3 | 2019-07-01 | (null) |  green |
|  3 | 2019-08-01 | (null) |  green |
|  4 | 2019-01-01 | (null) | (null) |
|  4 | 2019-02-01 | (null) | (null) |
|  4 | 2019-03-01 | (null) |  green |
|  4 | 2019-04-01 | (null) |  green |
|  4 | 2019-05-01 | (null) |  green |
|  4 | 2019-06-01 | orange | orange |
|  4 | 2019-07-01 | (null) | orange |
|  4 | 2019-08-01 | (null) | orange |

我试图注释该查询，以便您可以遵循我的逻辑。我还在更改表中添加了一个额外的测试用例，以演示如果一个月内发生2个或更多更改，如何选择最新的用例。我添加的第二个更改是检查上一年的颜色设置。如果不应该这样做，则可以将年份和日期的支票移回s1中。

从本质上讲，我使用日期表创建了一个运行的“日历”，可以轻松地“抹上”丢失日期中的更改数据。然后将这些天应用于每个id。然后选择最新的更改并填写缺少的颜色。然后为每个id仅选择每月的第一天。

请注意，使用“日历表/日期”维度，可以轻松找到每个月的第三个星期二的颜色。

同样，如果您有很多id，并且您正在12月查看此报告，则该数据可能会变成很多数据。可能必须将其按摩到可控制的大小。

在每月的第一天获取列的最新值

3 个答案: