First off, I am not a database programmer.
I have built the following table for stock market tick data:
CREATE TABLE [dbo].[Tick]
(
[trade_date] [int] NOT NULL,
[delimiter] [tinyint] NOT NULL,
[time_stamp] [int] NOT NULL,
[exchange] [tinyint] NOT NULL,
[symbol] [varchar](10) NOT NULL,
[price_field] [tinyint] NOT NULL,
[price] [int] NOT NULL,
[size_field] [tinyint] NOT NULL,
[size] [int] NOT NULL,
[exchange2] [tinyint] NOT NULL,
[trade_condition] [tinyint] NOT NULL
) ON [PRIMARY]
GO
The table will store 6 years of data to begin with. At an average of 300 million ticks per day that would be about 450 billion rows.
Common query on this table is to get all the ticks for some symbol(s) over a date range:
SELECT
trade_date, time_stamp, symbol, price, size
WHERE
trade_date > 20160101 and trade_date < 20170101
AND symbol = 'AAPL'
AND price_field = 0
ORDER BY
trade_date, time_stamp
This is my first attempt at an index:
CREATE UNIQUE CLUSTERED INDEX [ClusteredIndex-20180324-183113]
ON [dbo].[Tick]
(
[trade_date] ASC,
[symbol] ASC,
[time_stamp] ASC,
[price_field] ASC,
[delimiter] ASC,
[exchange] ASC,
[price] ASC,
[size_field] ASC,
[size] ASC,
[exchange2] ASC,
[trade_condition] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
First, I put date before symbol because there's less days than symbol so the shorter path is to get to date first.
I have included all the columns I would potentially need to retrieve. When I tested building it for one day's worth of data the size of the index was relatively quite large, about 4gb for a 20gb table.
Two questions:
Is my not including a primary key to save space a wise choice assuming my query requirements don't change?
Would I save space if I only include trade_date and symbol in the index? How would that affect performance, because I've been told I need to include all the columns I need in the index otherwise retrieval would be very slow because it would have to go back to the primary key to find the values of columns not included in the index. If this is true, how would that even work when my table doesn't have a primary key?
答案 0 :(得分:2)
Your unique clustered index should contain the minimum amount of columns necessary to uniquely identify a row in your table. If that means almost every column in your table, I would think you should add an artificial primary key. Cutting an artificial primary key to save space is a poor decision IMO, only cut it if you can create a natural primary key out of your data.
The clustered index is essentially where all your data is stored. The leaf nodes of the index contain all the data for that row, the columns that make up the index determine how to reach those leaf nodes.
Including extra columns in your index to speed up queries only applies to NONCLUSTERED indexes, as there the leaf node generally only contains a lookup value. For these indexes, the way to include extra columns is to use the INCLUDE clause, not just list them all as part of the index. For example.
CREATE NONCLUSTERED INDEX [IX_TickSummary] ON [dbo].[Tick]
(
[trade_date] ASC,
[symbol] ASC
)
INCLUDE (
[time_stamp],
[price],
[size],
[price_field]
)
This is a concept known as creating a covering index, where the index itself contains all the columns needed to process your query so no additional lookup into the data table is needed. The up side of this is increased speed. The down side is that those INCLUDE'ed columns are essentially duplicated resulting in a large index and eating more space.
Include columns that are used very frequently, such as those used to generate summary listings. Columns that are queried infrequently, such as those only needed in detailed views, should be left out of the index to save space.
Potentially helpful reading: Using Covering Indexes to Improve Query Performance
答案 1 :(得分:1)
Looking at your most common query, you should create a composite index based first on the columns involved in the where
clause:
trade_date, simbol, price_field
then in select
time_stamp, symbol, price, size
This way, you can use the index for where and select column retrieving avoiding access to the data table
trade_date, simbol, price_field, time_stamp, symbol, price, size
In your sequence you have time_stamp before price_field .. a select column before a where column this don't let the db engine use completely the power of index