如何在SQL中有效地查找多个记录的运行最新更新?

时间:2017-10-10 00:00:56

标签: sql sql-server database optimization

考虑以下架构,

-- items which have periodic updates
CREATE TABLE items (
  [id] int identity(1, 1) primary key,
  [name] varchar(100) not null
);

-- item updates. updating an item generally means it has a new status, at a certain time.
CREATE TABLE updates (
  [id] int identity(1, 1) primary key,
  [item_id] int foreign key references items([id]),
  [new_status] varchar(100) not null,
  [update_date] datetime not null
);

用于跟踪一个项目的状态,通过许多州,随着时间的推移。

我一直试图找到一个有效的查询来回答以下问题:

  

对于许多项目,可以处于多种状态之一,我们记录状态更新,每天结束时每个州目前有多少项?

我有一个SQLFiddle here,它有一些示例数据,以及我目前对此查询的尝试。 它在一些项目上运行正常,但我的数据库有数十万,因此我的查询目前大约需要5分钟才能运行。

是否有更有效的查询来回答这个问题?

测试数据:

-- items which have periodic updates
CREATE TABLE items (
  [id] int identity(1, 1) primary key,
  [name] varchar(100) not null
);

-- item updates. updating an item generally means it has a new status, at a certain time.
CREATE TABLE updates (
  [id] int identity(1, 1) primary key,
  [item_id] int foreign key references items([id]),
  [new_status] varchar(100) not null,
  [update_date] datetime not null
);

-- lets just say that we just created 3 new items
INSERT INTO items (name)
  VALUES ('item1'), ('item2'), ('item3');

-- and they all start in the new state
INSERT INTO updates (item_id, new_status, update_date)
SELECT
  [id],
  [new_status] = 'new',
  [update_date] = '2017-10-9 00:00:00.000'
FROM items

-- then we have them update over the course of a couple days
-- item 1
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000'
FROM items WHERE [name] = 'item1'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-12 00:00:00.000'
FROM items WHERE [name] = 'item1'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-14 00:00:00.000'
FROM items WHERE [name] = 'item1';

-- item 2
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000'
FROM items WHERE [name] = 'item2'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-11 00:00:00.000'
FROM items WHERE [name] = 'item2'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-12 00:00:00.000'
FROM items WHERE [name] = 'item2';

-- item 3
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-11 00:00:00.000'
FROM items WHERE [name] = 'item3'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-13 00:00:00.000'
FROM items WHERE [name] = 'item3'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-15 00:00:00.000'
FROM items WHERE [name] = 'item3';

当前查询:

-- =======================
--  Running latest record
-- =======================
-- Goal: For a period of time, with multiple items, which have multiple updates,
--       find the number of items which are in each state at the end of a day.
-- 
-- Issue: how can i improve this query for a large database?
-- 

SELECT
  dates.[update_date],
  state = latest_update.[new_status],
  volume = COUNT(*)
FROM items i -- start with the items that we want to count per day
CROSS JOIN (
  SELECT DISTINCT [update_date] FROM updates
) dates -- the days to count for
CROSS APPLY (
  -- this cross apply gets all updates for an item, that occurred on or before each date
  SELECT
    updates.*,
    RN = ROW_NUMBER() OVER (PARTITION BY [item_id] ORDER BY [update_date] DESC)
  FROM updates
  WHERE [update_date] <= dates.[update_date] AND [item_id] = i.[id]
) latest_update
WHERE latest_update.RN = 1 -- only count the latest update
GROUP BY dates.[update_date], latest_update.[new_status]
ORDER BY dates.[update_date], latest_update.[new_status]

[结果]

|          update_date |       state | volume |
|----------------------|-------------|--------|
| 2017-10-09T00:00:00Z |         new |      3 |
| 2017-10-10T00:00:00Z | in progress |      2 |
| 2017-10-10T00:00:00Z |         new |      1 |
| 2017-10-11T00:00:00Z | in progress |      2 |
| 2017-10-11T00:00:00Z |       ready |      1 |
| 2017-10-12T00:00:00Z |    complete |      1 |
| 2017-10-12T00:00:00Z | in progress |      1 |
| 2017-10-12T00:00:00Z |       ready |      1 |
| 2017-10-13T00:00:00Z |    complete |      1 |
| 2017-10-13T00:00:00Z |       ready |      2 |
| 2017-10-14T00:00:00Z |    complete |      2 |
| 2017-10-14T00:00:00Z |       ready |      1 |
| 2017-10-15T00:00:00Z |    complete |      3 |

2 个答案:

答案 0 :(得分:3)

一种方法是使用条件聚合:

raw_data = "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF3300005CFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFF9500B158DFFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFB200000003FFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFB100000002FFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFF643040B80FFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF3300002BFFED843913051B59D1FFFFFFFFFFFF03000184FFFF3300002BAB0F00000000000007BBFFFFFFFFFF03000057FFFF330000080100000000000000001DF8FFFFFFFF03000057FFFF330000000042A9D4D08D0D000000ADFFFFFFFF03000057FFFF330000059DFFFFFFFFFFAA00000070FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFA05000051FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF18000045FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF1B000043FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF1B000043FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF1B000043FFFFFFFF03000057"

答案 1 :(得分:0)

下面语句末尾的GROUP BY子句根据其值对new_status列中的数据进行分组。然后,数据库向用户显示new_status列中的“distinct”值列表。

select new_status,count(new_status) from updates group by new_status

换句话说,如果我们在没有count(new_status)部分的情况下运行查询,那么它将完全相同:

select distinct new_status from updates

因为我们要求计数,所以数据库能够计算它组合在一起的每个不同值的迭代次数,并将它们显示在count(new_status)列中。因为数据库不会为计算分组更新值的列指定名称,但您可以这样做:

select new_status,count(new_status) as nmbr_items from updates group by new_status