按ID选择多行,是否比WHERE IN更快

时间:2013-07-03 00:17:33

标签: sql sql-server-2008-r2

我有一个SQL表,我想按ID选择多行。例如,我想从我的表中获取ID为1,5和9的行。

我一直在使用类似于下面的WHERE IN语句执行此操作:

SELECT [Id]
FROM [MyTable]
WHERE [Id] IN (1,5,9)

然而,对于' IN'中的大量项目而言,这是非常缓慢的。条款

以下是使用1,000,000行表格中的位置选择行的一些性能数据

Querying for 1 random keys (where in) took 0ms
Querying for 1000 random keys (where in) took 46ms
Querying for 2000 random keys (where in) took 94ms
Querying for 3000 random keys (where in) took 249ms
Querying for 4000 random keys (where in) took 316ms
Querying for 5000 random keys (where in) took 391ms
Querying for 6000 random keys (where in) took 466ms
Querying for 7000 random keys (where in) took 552ms
Querying for 8000 random keys (where in) took 644ms
Querying for 9000 random keys (where in) took 743ms
Querying for 10000 random keys (where in) took 853ms

是否有比使用WHERE IN更快的方法来执行此操作。

我们无法进行连接,因为这是在断开连接的系统之间。

我听过in memory temp table joined to the data in MYSQL may be faster但是根据我的研究,MSSQL没有内存表选项,即便如此,也不会在插入到内存中时完全相同的索引扫描临时表作为WHERE IN有吗?

修改

此表的ID为PK,因此具有默认的PK索引,cf

CREATE TABLE [dbo].[Entities](
    [Id] [int] IDENTITY(1,1) NOT NULL,
 CONSTRAINT [PK_dbo.Entities] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

执行计划

enter image description here

以下是控制台应用的GIST,可生成这些效果结果https://gist.github.com/lukemcgregor/5914774

编辑2 我创建了一个函数,它从逗号分隔的字符串创建临时表,然后与该表连接。它更快,但我认为主要是因为解析查询的问题

Querying for 1 random keys took 1ms
Querying for 1000 random keys took 34ms
Querying for 2000 random keys took 69ms
Querying for 3000 random keys took 111ms
Querying for 4000 random keys took 143ms
Querying for 5000 random keys took 182ms
Querying for 6000 random keys took 224ms
Querying for 7000 random keys took 271ms
Querying for 8000 random keys took 315ms
Querying for 9000 random keys took 361ms
Querying for 10000 random keys took 411ms

3 个答案:

答案 0 :(得分:9)

好的,所以我通过定义一个表类型然后将该类型直接传递给查询并加入到它中来实现它。

SQL中的

CREATE TYPE [dbo].[IntTable] AS TABLE(
    [value] [int] NULL
)

代码

DataTable dataTable = new DataTable("mythang");
dataTable.Columns.Add("value", typeof(Int32));

toSelect.ToList().ForEach(selectItem => dataTable.Rows.Add(selectItem));

using (SqlCommand command = new SqlCommand(
    @"SELECT * 
    FROM [dbo].[Entities] e 
    INNER JOIN @ids on e.id = value", con))
{
    var parameter = command.Parameters.AddWithValue("@ids", dataTable);
    parameter.SqlDbType = System.Data.SqlDbType.Structured;
    parameter.TypeName = "IntTable";

    using (SqlDataReader reader = command.ExecuteReader())
    {
        while (reader.Read())
        {
            results.Add(reader.GetInt32(0));
        }
    }
}

这会产生以下结果

Querying for 1 random keys (passed in table value) took 2ms
Querying for 1000 random keys (passed in table value) took 3ms
Querying for 2000 random keys (passed in table value) took 4ms
Querying for 3000 random keys (passed in table value) took 6ms
Querying for 4000 random keys (passed in table value) took 8ms
Querying for 5000 random keys (passed in table value) took 9ms
Querying for 6000 random keys (passed in table value) took 11ms
Querying for 7000 random keys (passed in table value) took 13ms
Querying for 8000 random keys (passed in table value) took 17ms
Querying for 9000 random keys (passed in table value) took 16ms
Querying for 10000 random keys (passed in table value) took 18ms

答案 1 :(得分:3)

我猜你是否加入了一个由主键索引的内存表,例如:

declare @tbl table (ids int primary key)

您可以使用您需要的ID填充此表,并预先形成优化的内部联接。

问题可能是填补它所需的时间。我想你可以有一个链接服务器,或者可以使用BCP实用程序来填充临时表,然后删除它。

答案 2 :(得分:2)

首先,我认为声称您的数据提示O(n log(n))是一个延伸。 (顺便说一句,你做了性能测试真是太棒了。)这是每个值的时间:

1000    0.046
2000    0.047
3000    0.083
4000    0.079
5000    0.078
6000    0.078
7000    0.079
8000    0.081
9000    0.083
10000   0.085

虽然随着时间的推移会略有增加,但从2000年到2000年的跳跃会更加突出。如果这是可重复的,那么问题就是为什么会出现这种不连续性。

对我来说,这是对O(n)O(n log(n))的更多建议。但是,理论值的经验估计很难近似。所以,确切的限制并不那么重要。

我希望性能为O(n)(其中n是实际值,而不是某些估计值中的位长度)。我的理解是in表现得像一组巨大的or。大多数记录未通过测试,因此他们必须进行所有比较。因此O(n)

下一个问题是你是否在id字段上有索引。在这种情况下,您可以在O(n log(n)) time ( log(n)for traversing the index and n`中获取匹配ID的集合,以便为每个值执行此操作。这种接缝更糟糕,但是我们忽略了原始表格大小的因素。这应该是一个巨大的胜利。

正如Andre建议的那样,您可以加载表并对临时表进行连接。我会省略索引,因为你可能最好在较大的表上使用索引。这应该让你O(n log(n)) - 没有(重大)依赖于原始表的大小。或者,您可以省略索引并O(n * m),其中m是原始表的大小。我认为在临时表上构建的任何索引都会让你回到O(n log(n))性能(假设数据没有预先排序)。

将所有内容放在查询中都有类似的,未说明的问题 - 解析查询。随着字符串变长,这需要更长的时间。

简而言之,我赞扬您进行性能测量,但不是为了得出有关算法复杂性的结论。我不认为您的数据支持您的结论。此外,查询的处理比您建议的要复杂一些,并且您忽略了较大表的大小 - 这可能具有显着影响。而且,我很好奇2000到3000行之间发生了什么。