考虑以下架构,
-- items which have periodic updates
CREATE TABLE items (
[id] int identity(1, 1) primary key,
[name] varchar(100) not null
);
-- item updates. updating an item generally means it has a new status, at a certain time.
CREATE TABLE updates (
[id] int identity(1, 1) primary key,
[item_id] int foreign key references items([id]),
[new_status] varchar(100) not null,
[update_date] datetime not null
);
用于跟踪一个项目的状态,通过许多州,随着时间的推移。
我一直试图找到一个有效的查询来回答以下问题:
对于许多项目,可以处于多种状态之一,我们记录状态更新,每天结束时每个州目前有多少项?
我有一个SQLFiddle here,它有一些示例数据,以及我目前对此查询的尝试。 它在一些项目上运行正常,但我的数据库有数十万,因此我的查询目前大约需要5分钟才能运行。
是否有更有效的查询来回答这个问题?
测试数据:
-- items which have periodic updates
CREATE TABLE items (
[id] int identity(1, 1) primary key,
[name] varchar(100) not null
);
-- item updates. updating an item generally means it has a new status, at a certain time.
CREATE TABLE updates (
[id] int identity(1, 1) primary key,
[item_id] int foreign key references items([id]),
[new_status] varchar(100) not null,
[update_date] datetime not null
);
-- lets just say that we just created 3 new items
INSERT INTO items (name)
VALUES ('item1'), ('item2'), ('item3');
-- and they all start in the new state
INSERT INTO updates (item_id, new_status, update_date)
SELECT
[id],
[new_status] = 'new',
[update_date] = '2017-10-9 00:00:00.000'
FROM items
-- then we have them update over the course of a couple days
-- item 1
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000'
FROM items WHERE [name] = 'item1'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-12 00:00:00.000'
FROM items WHERE [name] = 'item1'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-14 00:00:00.000'
FROM items WHERE [name] = 'item1';
-- item 2
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000'
FROM items WHERE [name] = 'item2'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-11 00:00:00.000'
FROM items WHERE [name] = 'item2'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-12 00:00:00.000'
FROM items WHERE [name] = 'item2';
-- item 3
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-11 00:00:00.000'
FROM items WHERE [name] = 'item3'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-13 00:00:00.000'
FROM items WHERE [name] = 'item3'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-15 00:00:00.000'
FROM items WHERE [name] = 'item3';
当前查询:
-- =======================
-- Running latest record
-- =======================
-- Goal: For a period of time, with multiple items, which have multiple updates,
-- find the number of items which are in each state at the end of a day.
--
-- Issue: how can i improve this query for a large database?
--
SELECT
dates.[update_date],
state = latest_update.[new_status],
volume = COUNT(*)
FROM items i -- start with the items that we want to count per day
CROSS JOIN (
SELECT DISTINCT [update_date] FROM updates
) dates -- the days to count for
CROSS APPLY (
-- this cross apply gets all updates for an item, that occurred on or before each date
SELECT
updates.*,
RN = ROW_NUMBER() OVER (PARTITION BY [item_id] ORDER BY [update_date] DESC)
FROM updates
WHERE [update_date] <= dates.[update_date] AND [item_id] = i.[id]
) latest_update
WHERE latest_update.RN = 1 -- only count the latest update
GROUP BY dates.[update_date], latest_update.[new_status]
ORDER BY dates.[update_date], latest_update.[new_status]
[结果] :
| update_date | state | volume |
|----------------------|-------------|--------|
| 2017-10-09T00:00:00Z | new | 3 |
| 2017-10-10T00:00:00Z | in progress | 2 |
| 2017-10-10T00:00:00Z | new | 1 |
| 2017-10-11T00:00:00Z | in progress | 2 |
| 2017-10-11T00:00:00Z | ready | 1 |
| 2017-10-12T00:00:00Z | complete | 1 |
| 2017-10-12T00:00:00Z | in progress | 1 |
| 2017-10-12T00:00:00Z | ready | 1 |
| 2017-10-13T00:00:00Z | complete | 1 |
| 2017-10-13T00:00:00Z | ready | 2 |
| 2017-10-14T00:00:00Z | complete | 2 |
| 2017-10-14T00:00:00Z | ready | 1 |
| 2017-10-15T00:00:00Z | complete | 3 |
答案 0 :(得分:3)
一种方法是使用条件聚合:
raw_data = "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF3300005CFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFF9500B158DFFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFB200000003FFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFB100000002FFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFF643040B80FFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFF3300002BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF3300002BFFED843913051B59D1FFFFFFFFFFFF03000184FFFF3300002BAB0F00000000000007BBFFFFFFFFFF03000057FFFF330000080100000000000000001DF8FFFFFFFF03000057FFFF330000000042A9D4D08D0D000000ADFFFFFFFF03000057FFFF330000059DFFFFFFFFFFAA00000070FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFA05000051FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF18000045FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF1B000043FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF1B000043FFFFFFFF03000057FFFF3300002BFFFFFFFFFFFFFF1B000043FFFFFFFF03000057"
答案 1 :(得分:0)
下面语句末尾的GROUP BY子句根据其值对new_status列中的数据进行分组。然后,数据库向用户显示new_status列中的“distinct”值列表。
select new_status,count(new_status) from updates group by new_status
换句话说,如果我们在没有count(new_status)部分的情况下运行查询,那么它将完全相同:
select distinct new_status from updates
因为我们要求计数,所以数据库能够计算它组合在一起的每个不同值的迭代次数,并将它们显示在count(new_status)列中。因为数据库不会为计算分组更新值的列指定名称,但您可以这样做:
select new_status,count(new_status) as nmbr_items from updates group by new_status