什么是最快的sql实现,如' x%'在密钥

时间:2015-11-11 11:15:37

标签: c# performance dictionary collections trie

我需要对数十万个密钥进行非常快速的前缀"sql like"搜索。我尝试使用SortedList,Dictionary和SortedDictionary进行性能测试,我喜欢这样做:

var dictionary = new Dictionary<string, object>();
// add a million random strings
var results = dictionary.Where(x=>x.Key.StartsWith(prefix));

我发现它们都需要很长时间,Dictionary是最快的,而SortedDictionary是最慢的。

然后我尝试了http://www.codeproject.com/Articles/640998/NET-Data-Structures-for-Prefix-String-Search-and-S的Trie实现,这个实现速度更快,即。毫秒而不是秒。

所以我的问题是,有没有我可以用于上述要求的.NET集合?我原以为这是一个普遍的要求。

我的基本测试:

    class Program
    {
        static readonly Dictionary<string, object> dictionary = new Dictionary<string, object>(); 
        static Trie<object> trie = new Trie<object>(); 

        static void Main(string[] args)
        {
            var random = new Random();
            for (var i = 0; i < 100000; i++)
            {
                var randomstring = RandomString(random, 7);
                dictionary.Add(randomstring, null);
                trie.Add(randomstring, null);
            }

            var lookups = new string[10000];
            for (var i = 0; i < lookups.Length; i++)
            {
                lookups[i] = RandomString(random, 3);
            }

            // compare searching
            var sw = new Stopwatch();
            sw.Start();
            foreach (var lookup in lookups)
            {
                var exists = dictionary.Any(k => k.Key.StartsWith(lookup));
            }
            sw.Stop();
            Console.WriteLine("dictionary.Any(k => k.Key.StartsWith(randomstring)) took : {0} ms", sw.ElapsedMilliseconds);

// test other collections

            sw.Restart();
            foreach (var lookup in lookups)
            {
                var exists = trie.Retrieve(lookup).Any();
            }
            sw.Stop();
            Console.WriteLine("trie.Retrieve(lookup) took : {0} ms", sw.ElapsedMilliseconds);

            Console.ReadKey();
        }

        public static string RandomString(Random random,int length)
        {
            const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

            return new string(Enumerable.Repeat(chars, length)
              .Select(s => s[random.Next(s.Length)]).ToArray());
        }
    }

结果:

dictionary.Any(k => k.Key.StartsWith(randomstring)) took : 80990 ms
trie.Retrieve(lookup) took : 115 ms

3 个答案:

答案 0 :(得分:0)

如果排序很重要,请尝试使用SortedList代替SortedDictionary。它们都具有相同的功能,但它们的实现方式不同。想要枚举元素时SortedList更快(并且您可以通过索引访问元素),如果有很多元素并且您想要在元素中插入新元素,SortedDictionary会更快集合的中间。

所以试试这个:

var sortedList = new SortedList<string, object>();
// populate list...

sortedList.Keys.Any(k => k.StartsWith(lookup));

如果您有一百万个元素,但是一旦填充了字典,您就不想重新排序它们,您可以结合它们的优点:使用随机元素填充SortedDictionary,然后创建一个新的List<KeyValuePair<,>>SortedList<,>

答案 1 :(得分:0)

如果您可以对键进行一次排序,然后重复使用它们来查找前缀,那么您可以使用二进制搜索来加快速度。

为了获得最大性能,我将使用两个数组,一个用于键,一个用于值,并使用Array.Sort()的重载来排序主数据和附加数组。

然后,您可以使用Array.BinarySearch()搜索以给定前缀开头的最近的键,并返回匹配的索引。

当我尝试时,如果有一个或多个匹配的前缀,每次检查似乎只需要大约0.003ms。

这是一个可运行的控制台应用程序来演示(记得在RELEASE版本上做你的计时):

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        public static void Main()
        {
            int count = 1000000;
            object obj = new object();

            var keys   = new string[count];
            var values = new object[count];

            for (int i = 0; i < count; ++i)
            {
                keys[i] = randomString(5, 16);
                values[i] = obj;
            }

            // Sort key array and value arrays in tandem to keep the relation between keys and values.

            Array.Sort(keys, values);

            // Now you can use StartsWith() to return the indices of strings in keys[]
            // that start with a specific string. The indices can be used to look up the
            // corresponding values in values[].

            Console.WriteLine("Count of ZZ = " + StartsWith(keys, "ZZ").Count());

            // Test a load of times with 1000 random prefixes.

            var prefixes = new string[1000];

            for (int i = 0; i < 1000; ++i)
                prefixes[i] = randomString(1, 8);

            var sw = Stopwatch.StartNew();

            for (int i = 0; i < 1000; ++i)
                for (int j = 0; j < 1000; ++j)
                    StartsWith(keys, prefixes[j]).Any();

            Console.WriteLine("1,000,000 checks took {0} for {1} ms each.", sw.Elapsed, sw.ElapsedMilliseconds/1000000.0);
        }

        public static IEnumerable<int> StartsWith(string[] array, string prefix)
        {
            int index = Array.BinarySearch(array, prefix);

            if (index < 0)
                index = ~index;

            // We might have landed partway through a set of matches, so find the first match.

            if (index < array.Length)
                while ((index > 0) && array[index-1].StartsWith(prefix, StringComparison.OrdinalIgnoreCase))
                    --index;

            while ((index < array.Length) && array[index].StartsWith(prefix, StringComparison.OrdinalIgnoreCase))
                yield return index++;
        }

        static string randomString(int minLength, int maxLength)
        {
            int length = rng.Next(minLength, maxLength);

            const string CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
            return new string(Enumerable.Repeat(CHARS, length)
              .Select(s => s[rng.Next(s.Length)]).ToArray());
        }

        static readonly Random rng = new Random(12345);
    }
}

答案 2 :(得分:0)

因此,经过一些测试后,我发现了一些与使用BinarySearch关系密切的东西,只有缺点是你必须从a到z对键进行排序。但是最大的列表,它会越慢,所以 三元搜索 是你用二进制PC架构所能找到的最快的。

方法:Credits shoult go to @Guffa

    public static int BinarySearchStartsWith(List<string> words, string prefix, int min, int max)
    {
        while (max >= min)
        {
            var mid = (min + max) / 2;
            var comp = string.CompareOrdinal(words[mid].Substring(0, prefix.Length), prefix);
            if (comp >= 0)
            {
                if (comp > 0)
                    max = mid - 1;
                else
                    return mid;
            }
            else
                min = mid + 1;
        }
        return -1;
    }

并测试实施

        var keysToList = dictionary.Keys.OrderBy(q => q).ToList();
        sw = new Stopwatch();
        sw.Start();
        foreach (var lookup in lookups)
        {
            bool exist = BinarySearchStartsWith(keysToList, lookup, 0, keysToList.Count - 1)!= -1
        }
        sw.Stop();

enter image description here