Question

我正在寻找一种优化算法，它给出了我编写的结构的数组（或列表），并删除了重复的元素并将其返回。
我知道我可以通过复杂度为O（n ^ 2）的简单算法来实现;但我想要一个更好的算法。

任何帮助将不胜感激。

Answer 1

这接近O（N）时间：

var result = items.Distinct().ToList();

[编辑]

由于Microsoft没有提供O（N）时间的书面证据，因此我使用以下代码进行了一些计时：

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        private void run()
        {
            test(1000);
            test(10000);
            test(100000);
        }

        private void test(int n)
        {
            var items = Enumerable.Range(0, n);
            new Action(() => items.Distinct().Count())
                .TimeThis("Distinct() with n == " + n + ": ", 10000);
        }

        static void Main()
        {
            new Program().run();
        }
    }

    static class DemoUtil
    {
        public static void TimeThis(this Action action, string title, int count = 1)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            Console.WriteLine("Calling {0} {1} times took {2}",  title, count, sw.Elapsed);
        }
    }
}

结果是：

Calling Distinct() with n == 1000:   10000 times took 00:00:00.5008792
Calling Distinct() with n == 10000:  10000 times took 00:00:06.1388296
Calling Distinct() with n == 100000: 10000 times took 00:00:58.5542259

时间与n近似线性增加，至少对于此特定测试而言，这表明正在使用O（N）算法。

Answer 2

您可以在 O（NlogN）时间对数组进行排序，并比较相邻元素以删除重复元素。

Answer 3

您可以使用复杂度为O（N）的HashSet：

List<int> RemoveDuplicates(List<int> input)
{
    var result = new HashSet<int>(input);
    return result.ToList();
}

但它会增加内存使用量。

Answer 4

实际使用LINQ的Distinct是最简单的解决方案。它使用基于哈希表的方法，可能与以下算法非常相似。

如果您对这种算法的外观感兴趣：

IEnumerable<T> Distinct(IEnumerable<T> sequence)
{
    var alreadySeen=new HashSet<T>();
    foreach(T item in sequence)
    {
        if(alreadySeen.Add(item))// Add returns false if item was already in set
            yield return;
    }
}

如果有d个不同的元素和n个元素，那么此算法将花费O(d)个内存和O(n)时间。

由于此算法使用散列集，因此需要散布良好的散列才能实现O(n)运行时。如果哈希很糟糕，运行时可以退化为O(n*d)

一种优化算法，用于制作不同的数组

4 个答案: