我有两个表,每天我们添加大约100k和1.5M的新行。这些是日志条目,在超过99%的案例中,我对阅读时的最后3个工作日感兴趣。
如果我运行像
这样的简单查询SELECT
0 as Id, ProcessElementName, Null as ModelPath, Status, Remark, ValidFrom, Application, JobID, JobName, CreateDate, CreatedBy, MessageType, Running, Manual, Environment, RunIdentifier, BatchJobGroup, BatchJob, IsTemp, TotalRows = COUNT(*) OVER()
FROM dbo.pclTB_ProcessElementInfo WITH (NOLOCK)
WHERE
ValidFrom > '6/26/2017 12:00:00 AM'
AND ValidFrom <= '6/26/2017 11:59:59 PM'
AND (Environment in ('---')) AND
(
Remark LIKE '%' + 'btve' + '%'
AND Application = '---'
AND (IsTemp = 0 OR IsTemp IS NULL )
AND ProcessElementName = '---'
)
ORDER BY JobID ASC
OFFSET 0 ROWS FETCH NEXT 1000 ROWS ONLY
最多可能需要10秒钟。其他查询中有一些连接,但大多数都很简单。 当我手动更新统计数据时,执行时间下降到大约2秒,但我确定还有改进的余地(我知道跟踪标志2371)。
优化表(或查询?)以获取最新行的最佳方法是什么?创建一个只包含最近X天条目的新表可能有意义吗?
编辑: 这是用于上述查询的索引
CREATE NONCLUSTERED INDEX [IX_ProcessElementNameApplicationEnvironmentValidFrom] ON [dbo].[pclTB_ProcessElementInfo]
(
[ProcessElementName] ASC,
[Application] ASC,
[Environment] ASC,
[ValidFrom] ASC
)
INCLUDE (
[Status],
[Remark],
[JobID],
[JobName],
[CreateDate],
[CreatedBy],
[MessageType],
[Running],
[Manual],
[RunIdentifier],
[BatchJobGroup],
[BatchJob],
[IsTemp]
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
答案 0 :(得分:2)
[1]分区是一种解决方案(如果您有Enterprise Edition和SQL2005 / 2008 [R2] / 2012/2014 / 2016 SP1之前版本)或SQL2016SP1 +。
[2]另一种解决方案是过滤索引。我会为过去3天的每一天创建一个过滤索引:
CREATE NONCLUSTERED INDEX IXF_Table_2017_06_28_...
ON dbo.Table (SomeColumn1)
INCLUDE (Column2, Column3, ...)
WHERE Timestamp >= '2017-06-28' AND Timestamp < '2017-06-29'
CREATE NONCLUSTERED INDEX IXF_Table_2017_06_27_...
ON dbo.Table (SomeColumn1)
INCLUDE (Column2, Column3, ...)
WHERE Timestamp >= '2017-06-27' AND Timestamp < '2017-06-28'
CREATE NONCLUSTERED INDEX IXF_Table_2017_06_26_...
ON dbo.Table (SomeColumn1)
INCLUDE (Column2, Column3, ...)
WHERE Timestamp >= '2017-06-26' AND Timestamp < '2017-06-26'
另外,请确保这些索引包含所有必要的列(请参阅覆盖索引)。
[3]然后,初始查询应该重写:
SELECT Column1
FROM dbo.Table WITH (NOLOCK)
WHERE
-- Timestamp filters (these filters should match with filter predicates / `CREATE INDEX ... ON ... WHERE _`)
(
(Timestamp >= '2017-06-27' AND Timestamp < '2017-06-28')
OR (Timestamp >= '2017-06-26' AND Timestamp < '2017-06-26')
)
-- End of Timestamp filters
AND some_conditions
ORDER BY Column1 ASC
OFFSET 0 ROWS FETCH NEXT 1000 ROWS ONLY
[4]对于以上所有SQL语句,应使用SET
后的SET ANSI_NULLS, QUOTED_IDENTIFIER ON
SET ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL ON
SET NUMERIC_ROUNDABORT OFF
}:
import matplotlib.pyplot as plt
from scipy import stats
x = [1,2,3,4,5,6,7,8,9]
y = [1,2,3,4,5,6,7,8,9]
n = ['A', 'B', 'C', 'D', 'E' , 'F', 'G', 'H', 'I']
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
fig, ax = plt.subplots()
plt.scatter(x, y, marker='o', color = 'k', s = 0.00001)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]))
predict_y = [(intercept + (slope * x)) for x in x]
plt.plot(x, predict_y,'k-', alpha=0.4, LineWidth=0.3)
plt.xlabel('Number 1')
plt.ylabel('Number 2')
plt.figtext(.73, .84, u"R²: %0.2f " % r_value**2)
plt.figtext(.73, .79, u"P-value: %0.3f " % p_value)
plt.savefig('test.eps', format = 'eps', dpi=1000)
plt.show()
答案 1 :(得分:1)
您可以考虑表partitioning。我们假设您将在过去3天内创建分区,并为其余数据创建分区。然后,您将更新查询以仅使用该特定分区 它有一些限制,例如您只能使用用于聚簇索引的数据进行分区,但可能是这样 您不必使用上面链接中提到的不同文件组。这是另一个可能是你无法理解的链接。大概是How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005
答案 2 :(得分:1)
将数据插入表格时,插入另一个保存最后x天记录的表格。然后,您可以使用存储过程在一定时间后自动删除记录。 How to automatically delete records in sql server after a certain amount of time
答案 3 :(得分:1)
您可以拥有每日作业来重新创建已过滤的索引。您可以使用过去三天的日期过滤复制现有索引:
DECLARE @sql varchar(8000) = '
IF EXISTS (SELECT 1 FROM sys.indexes WHERE name = ''IX_IndexName'')
DROP INDEX IX_IndexNameON My_table ;
CREATE NONCLUSTERED INDEX IX_IndexNameON My_table (
timestamp
)
WHERE timestamp > ''' + CONVERT(varchar(25),DATEADD(d,-3,GETDATE()) ,121) + ''';';
EXEC (@sql);