在随机生成的整数及其出现频率中查找所有模式的最有效方法

时间:2018-08-10 13:37:15

标签: c# linq statistics

如果将用C#编写的方法传递为null或介于0到6,000,000之间的随机生成且未排序的整数,那么最有效的方法是确定所有模式以及它们发生了多少次?尤其是,有人可以在我一直在努力的基于LINQ的解决方案中为我提供帮助吗?

这是我到目前为止所拥有的:

到目前为止,我最接近的LINQ解决方案仅获取其找到的第一个模式,并且未指定出现的次数。在我的计算机上,它的速度也是丑陋,笨拙的实现的7倍左右。

    int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k => k.Key).FirstOrDefault();

我的手动编码方法。

    public class NumberCount
    {
        public int Value;
        public int Occurrences;

        public NumberCount(int value, int occurrences)
        {
            Value = value;
            Occurrences = occurrences;
        }
    }

    private static List<NumberCount> findMostCommon(List<int> integers)
    {
        if (integers == null)
            return null;
        else if (integers.Count < 1)
            return new List<NumberCount>();

        List<NumberCount> mostCommon = new List<NumberCount>();

        integers.Sort();

        mostCommon.Add(new NumberCount(integers[0], 1));
        for (int i=1; i<integers.Count; i++)
        {
            if (mostCommon[mostCommon.Count - 1].Value != integers[i])
                mostCommon.Add(new NumberCount(integers[i], 1));
            else
                mostCommon[mostCommon.Count - 1].Occurrences++;
        }

        List<NumberCount> answer = new List<NumberCount>();
        answer.Add(mostCommon[0]);
        for (int i=1; i<mostCommon.Count; i++) 
        {
            if (mostCommon[i].Occurrences > answer[0].Occurrences)
            {
                if (answer.Count == 1)
                {
                    answer[0] = mostCommon[i];
                }
                else
                {
                    answer = new List<NumberCount>();
                    answer.Add(mostCommon[i]);
                }
            }
            else if (mostCommon[i].Occurrences == answer[0].Occurrences)
            {
                answer.Add(mostCommon[i]);
            }
        }

        return answer;        
    }

基本上,我试图获得一种优雅,紧凑的LINQ解决方案,其速度至少与我的丑陋方法一样快。预先感谢您的任何建议。

5 个答案:

答案 0 :(得分:0)

我个人会使用ConcurrentDictionary来更新计数器,并且可以更快地访问字典。我经常使用这种方法,而且更具可读性。

  // create a dictionary
  var dictionary = new ConcurrentDictionary<int, int>();

  // list of you integers
  var numbers = new List<int>();

  // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
  numbers.AsParallel().ForAll((number) =>
  {
      // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
      dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
  });

那么,要获得最多的发生只是很多方法。我不完全了解您的版本,但是最多只能是1个汇总的问题,就像这样:

var topMostOccurence = dictionary.Aggregate((x, y) => { return x.Value > y.Value ? x : y; });

答案 1 :(得分:0)

我在Intel i7-8700K上使用以下代码进行了测试,并获得了以下结果:

Lambda:在134毫秒内找到78。

手动:在368毫秒内找到78。

字典:在195毫秒内找到78个。

    static IEnumerable<int> GenerateNumbers(int amount)
    {
        Random r = new Random();
        for (int i = 0; i < amount; i++)
            yield return r.Next(100);
    }

    static void Main(string[] args)
    {
        var numbers = GenerateNumbers(6_000_000).ToList();

        Stopwatch sw = Stopwatch.StartNew();
        int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k =>
        {
            int count = k.Count();
            return new { Mode = k.Key, Count = count };
        }).FirstOrDefault().Mode;
        sw.Stop();
        Console.WriteLine($"Lambda: found {mode} in {sw.ElapsedMilliseconds} ms.");


        sw = Stopwatch.StartNew();
        mode = findMostCommon(numbers)[0].Value;
        sw.Stop();
        Console.WriteLine($"Manual: found {mode} in {sw.ElapsedMilliseconds} ms.");

        // create a dictionary
        var dictionary = new ConcurrentDictionary<int, int>();

        sw = Stopwatch.StartNew();
        // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
        numbers.AsParallel().ForAll((number) =>
        {
            // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
            dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
        });
        mode = dictionary.Aggregate((x, y) => { return x.Value > y.Value ? x : y; }).Key;
        sw.Stop();
        Console.WriteLine($"Dictionary: found {mode} in {sw.ElapsedMilliseconds} ms.");


        Console.ReadLine();
    }

答案 2 :(得分:0)

您想要的是:2个以上的数字可能同时出现在数组中,例如:{1,1,1,2,2,2,3,3,3}

您当前的代码来自这里:Find the most occurring number in a List<int> 但是它只返回一个数字,这完全是错误的结果。

Linq的问题是:如果您不希望循环继续下去,循环将无法结束。

但是,在这里,我根据需要生成了LINQ的列表:

List<NumberCount> MaxOccurrences(List<int> integers)
{
    return integers?.AsParallel()
        .GroupBy(x => x)//group numbers, key is number, count is count
        .Select(k => new NumberCount(k.Key, k.Count()))
        .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
        .OrderByDescending(x => x.Key) //sort
        .FirstOrDefault()? //the first one is result
        .ToList();
}

测试详细信息:

数组大小:30000

30000
MaxOccurrences only
MaxOccurrences1: 207
MaxOccurrences2: 38 
=============
Full List
Original1: 28
Original2: 23
ConcurrentDictionary1: 32
ConcurrentDictionary2: 34
AsParallel1: 27
AsParallel2: 19
AsParallel3: 36

ArraySize:3000000

3000000
MaxOccurrences only
MaxOccurrences1: 3009
MaxOccurrences2: 1962 //<==this is the best one in big loop.
=============
Full List
Original1: 3200
Original2: 3234
ConcurrentDictionary1: 3391
ConcurrentDictionary2: 2681
AsParallel1: 3776
AsParallel2: 2389
AsParallel3: 2155

这是代码:

class Program
{
    static void Main(string[] args)
    {
        const int listSize = 3000000;
        var rnd = new Random();
        var randomList = Enumerable.Range(1, listSize).OrderBy(e => rnd.Next()).ToList();

        // the code that you want to measure comes here

        Console.WriteLine(randomList.Count);
        Console.WriteLine("MaxOccurrences only");

        Test(randomList, MaxOccurrences1);
        Test(randomList, MaxOccurrences2);


        Console.WriteLine("=============");
        Console.WriteLine("Full List");
        Test(randomList, Original1);
        Test(randomList, Original2);
        Test(randomList, AsParallel1);
        Test(randomList, AsParallel2);
        Test(randomList, AsParallel3);

        Console.ReadLine();
    }

    private static void Test(List<int> data, Action<List<int>> method)
    {
        var watch = System.Diagnostics.Stopwatch.StartNew();
        method(data);
        watch.Stop();
        Console.WriteLine($"{method.Method.Name}: {watch.ElapsedMilliseconds}");
    }
    private static void Original1(List<int> integers)
    {
        integers?.GroupBy(number => number)
            .OrderByDescending(group => group.Count())
            .Select(k => new NumberCount(k.Key, k.Count()))
            .ToList();
    }

    private static void Original2(List<int> integers)
    {
        integers?.GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count()))
            .OrderByDescending(x => x.Occurrences)
            .ToList();
    }

    private static void AsParallel1(List<int> integers)
    {
        integers?.GroupBy(number => number)
            .AsParallel() //each group will be count by a CPU unit
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    }

    private static void AsParallel2(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(number => number)
            .Select(k => new
            {
                Key = k.Key,
                Occurrences = k.Count()
            }) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    }

    private static void AsParallel3(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    }


    private static void MaxOccurrences1(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(number => number)
            .GroupBy(x => x.Count())
            .OrderByDescending(x => x.Key)
            .FirstOrDefault()?
            .ToList()
            .Select(k => new NumberCount(k.Key, k.Count()))
            .ToList();
    }

    private static void MaxOccurrences2(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(x => x)//group numbers, key is number, count is count
            .Select(k => new NumberCount(k.Key, k.Count()))
            .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
            .OrderByDescending(x => x.Key) //sort
            .FirstOrDefault()? //the first one is result
            .ToList();
    }
    private static void ConcurrentDictionary1(List<int> integers)
    {
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.ForEach(x => { result.AddOrUpdate(x, 1, (key, old) => old + 1); });

        result.OrderByDescending(x => x.Value).ToList();
    }
    private static void ConcurrentDictionary2(List<int> integers)
    {
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.AsParallel().ForAll(x => { result.AddOrUpdate(x, 1, (key, old) => old + 1); });

        result.OrderByDescending(x => x.Value).ToList();
    }

}
public class NumberCount
{
    public int Value;
    public int Occurrences;

    public NumberCount(int value, int occurrences)
    {
        Value = value;
        Occurrences = occurrences;
    }
}

答案 3 :(得分:0)

对于不同的长度,不同的代码效率更高,但是随着长度接近600万,这种方法似乎是最快的。通常,LINQ并不是用于提高代码速度,而是用于理解和可维护性,具体取决于您对函数式编程风格的看法。

您的代码相当快,并且使用GroupBy击败了简单的LINQ方法。通过使用List.Sort已高度优化这一事实,我的代码也获得了很好的好处,而我的代码也使用了该事实,但是在列表的本地副本上避免了更改源。我的代码与您的代码类似,但是围绕一次遍历进行设计,可以完成所有所需的计算。它使用我针对此问题重新优化的扩展方法,称为GroupByRuns,它返回一个IEnumerable<IGrouping<T,T>>。它也可以手动扩展,而不是退回到通用GroupByRuns上,后者为键和结果选择添加了额外的参数。由于.Net没有最终用户可访问的IGrouping<,>实现(!),因此,我推出了自己的实现ICollection的软件来优化Count()

此代码的运行速度是您的代码的1.3倍(在我对您的代码进行了5%的优化后)。

首先,使用RunGrouping类返回一组运行:

public class RunGrouping<T> : IGrouping<T, T>, ICollection<T> {
    public T Key { get; }
    int Count;

    int ICollection<T>.Count => Count;
    public bool IsReadOnly => true;

    public RunGrouping(T key, int count) {
        Key = key;
        Count = count;
    }

    public IEnumerator<T> GetEnumerator() {
        for (int j1 = 0; j1 < Count; ++j1)
            yield return Key;
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

    public void Add(T item) => throw new NotImplementedException();
    public void Clear() => throw new NotImplementedException();
    public bool Contains(T item) => Count > 0 && EqualityComparer<T>.Default.Equals(Key, item);
    public void CopyTo(T[] array, int arrayIndex) => throw new NotImplementedException();
    public bool Remove(T item) => throw new NotImplementedException();
}

第二,IEnumerable上的扩展方法将运行分组:

public static class IEnumerableExt {
    public static IEnumerable<IGrouping<T, T>> GroupByRuns<T>(this IEnumerable<T> src) {
        var cmp = EqualityComparer<T>.Default;
        bool notAtEnd = true;
        using (var e = src.GetEnumerator()) {
            bool moveNext() {
                notAtEnd = e.MoveNext();
                return notAtEnd;
            }
            IGrouping<T, T> NextRun() {
                var prev = e.Current;
                var ct = 0;
                while (cmp.Equals(e.Current, prev)) {
                    ++ct;
                    moveNext();
                }
                return new RunGrouping<T>(prev, ct);
            }

            moveNext();
            while (notAtEnd)
                yield return NextRun();
        }
    }
}

最后,找到最大计数模式的扩展方法。基本上,它会遍历所有运行并保留当前运行时间最长的int记录。

public static class IEnumerableIntExt {
    public static IEnumerable<KeyValuePair<int, int>> MostCommon(this IEnumerable<int> src) {
        var mysrc = new List<int>(src);
        mysrc.Sort();
        var maxc = 0;
        var maxmodes = new List<int>();
        foreach (var g in mysrc.GroupByRuns()) {
            var gc = g.Count();

            if (gc > maxc) {
                maxmodes.Clear();
                maxmodes.Add(g.Key);
                maxc = gc;
            }
            else if (gc == maxc)
                maxmodes.Add(g.Key);
        }

        return maxmodes.Select(m => new KeyValuePair<int, int>(m, maxc));
    }
}

给出现有的整数rl随机列表,您可以使用以下方法获得答案:

var ans = rl.MostCommon();

答案 4 :(得分:-1)

到目前为止,Netmage的速度是我发现的最快的速度。我唯一能胜过它的东西(至少在1到500,000,000的有效范围内)只能在我的计算机上使用范围从1到500,000,000或更小的值的数组,因为我只有8 GB的RAM 。这使我无法在完整的1到int.MaxValue范围内进行测试,并且我怀疑它在该大小的速度方面会落后于它,因为在更大的范围内它似乎越来越困难。它使用这些值作为索引,并使用这些索引处的值作为出现次数。使用600万随机生成的16位正整数,它与我在释放模式下的原始方法相比快约20倍。 32位整数(范围为1到500,000,000)的速度仅为它的1.6倍。

    private static List<NumberCount> findMostCommon(List<int> integers)
    {
        List<NumberCount> answers = new List<NumberCount>();

        int[] mostCommon = new int[_Max];

        int max = 0;
        for (int i = 1; i < integers.Count; i++)
        {
            int iValue = integers[i];
            mostCommon[iValue]++;
            int intVal = mostCommon[iValue];
            if (intVal > 1)
            {
                if (intVal > max)
                {
                    max++;
                    answers.Clear();
                    answers.Add(new NumberCount(iValue, max));
                }
                else if (intVal == max)
                {
                    answers.Add(new NumberCount(iValue, max));
                }
            }
        }

        if (answers.Count < 1)
            answers.Add(new NumberCount(0, -100)); // This -100 Occurrecnces value signifies that all values are equal.

        return answers;
    }

也许这样的分支是最佳的:

if (list.Count < sizeLimit) 
    answers = getFromSmallRangeMethod(list);
else 
    answers = getFromStandardMethod(list);