对大数据集中的属性进行有效过滤

时间:2017-06-20 15:49:39

标签: c#

我有一个包含大量对象的内存列表(让我们说150000)。每个对象都有一个我要搜索/过滤的字符串属性,如下所示:

var searchTerm = "something";
var result = listOfObjects.Where(o => o.Prop.Contains(searchTerm)).ToList();

这显然很慢。有没有办法加快速度?我已经尝试并行处理而没有任何好处。有没有办法涉及哈希集?或者也许对它进行排序并进行二分搜索?

2 个答案:

答案 0 :(得分:0)

这是我尝试的,授予我的数据有点不同,因为我只是生成随机字符串来测试过滤。但这是我的示例代码。

class Program
{
    static void Main(string[] args)
    {
        Start:

        List<Test> TestList = new List<Test>();
        int ObjectsToCreate = 1000000;
        Console.WriteLine($"Creating {ObjectsToCreate} Objects!");
        for (int x = 1; x <= ObjectsToCreate; x++)
        {
            TestList.Add(new Test() { Name = RandomString(100) });
        }
        Console.WriteLine($"Created {TestList.Count} objects.");
        string StringToSearchFor = "A";
        Console.WriteLine($"Benchmarking Now");
        System.Diagnostics.Stopwatch Watch = System.Diagnostics.Stopwatch.StartNew();

        var TestCollection = TestList.Where(Item => Item.Name.Contains(StringToSearchFor));
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With Where Into VAR: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With Where Into VAR: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        IEnumerable<Test> TestCollection_ = TestList.Where(Item => Item.Name.Contains(StringToSearchFor));
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With Where Into IEnumerable<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With Where Into IEnumerable<Test>: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        List<Test> TestCollection2 = TestList.Where(Item => Item.Name.Contains(StringToSearchFor)).ToList();
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With Where Into List<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With Where Into List<Test>: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        List<Test> TestCollection3 = TestList.AsParallel().Where(Item => Item.Name.Contains(StringToSearchFor)).ToList();
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With AsParallel First Where Into List<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With AsParallel First Where Into List<Test>: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        List<Test> TestCollection4 = TestList.Where(Item => Item.Name.Contains(StringToSearchFor)).AsParallel().ToList();
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With AsParallel Last Where Into List<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With AsParallel Last Where Into List<Test>: {Watch.ElapsedTicks} ticks");

        Console.ReadLine();
        goto Start;
    }

    public class Test
    {
        public string Name { get; set; }
    }

    private static Random random = new Random();
    public static string RandomString(int length)
    {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        return new string(Enumerable.Repeat(chars, length)
          .Select(s => s[random.Next(s.Length)]).ToArray());
    }
}

这是我从运行它得到的输出。

Creating 1000000 Objects!
Created 1000000 objects.
Benchmarking Now
Elapsed Time With Where Into VAR: 0ms
Elapsed Time With Where Into VAR: 192 ticks
Elapsed Time With Where Into IEnumerable<Test>: 0ms
Elapsed Time With Where Into IEnumerable<Test>: 4 ticks
Elapsed Time With Where Into List<Test>: 273ms
Elapsed Time With Where Into List<Test>: 934287 ticks
Elapsed Time With AsParallel First Where Into List<Test>: 164ms
Elapsed Time With AsParallel First Where Into List<Test>: 564069 ticks
Elapsed Time With AsParallel Last Where Into List<Test>: 192ms
Elapsed Time With AsParallel Last Where Into List<Test>: 658852 ticks

如果我多次运行相同的测试,那么将数据放入VAR的结果会降低到我的机器上大约7-8个滴答,但导出到IEnumerable会降低到大约2-3。这是100万件物品。 因此,我对你所定义的内容感到有些困惑,因为#34;非常缓慢&#34;。除非我完全误解了某些东西。

编辑:我的VAR和IEnumerable的例子不如我原先想的那样有效,请参阅下面我的答案的评论。

答案 1 :(得分:0)

我可以考虑一些事情。

  • 首先从你要测试的preiacte开始。与ContainsStartsWith相比,EndsWith相当昂贵。因此,请使用性能最佳谓词。
  • 并行化过滤进程。 Hint Parallel Namespace
  • 将函数(哈希)应用于所有对象并将它们插入到Dictionary中。 (要启用O(1)访问时间)如果“密钥”确实出现多次,请向哈希添加一些其他属性(salt)。
  • 考虑使用数据库

此外,它在很大程度上取决于您的对象结构。该对象是否提供了我们可用于比较的任何其他信息?如果是的话:

  • 将数据集分块(取决于属性值)并以最高概率开始在组中搜索(并行)。 (启发式功能)
  • 使用更多“复杂”结构(如树)来最小化数据量

根据程序的行为,避免加载所有数据集。加载(基于启发式值)仅特定数量的数据集(例如10.000),如果未预设值,则使用置换策略来获取新数据。