根据连续调用之间的经过时间优化批量大小

时间:2013-07-01 11:11:28

标签: c# algorithm batch-processing mathematical-optimization

我开始尝试创建以下内容:

public static IEnumerable<List<T>> OptimizedBatches<T>(this IEnumerable<T> items)

然后这个扩展方法的客户端会像这样使用它:

foreach (var list in extracter.EnumerateAll().OptimizedBatches()) 
{
   // at some unknown batch size, process time starts to 
   // increase at an exponential rate
}

以下是一个例子:

batch length         time
    1                 100ms
    2                 102ms
    4                 110ms
    8                 111ms
    16                118ms
    32                119ms
    64                134ms
    128               500ms <-- doubled length but time it took more than doubled
    256               1100ms <-- oh no!!

从上面可以看出,最佳批次长度是64,因为64/134是长度/时间的最佳比例。

所以问题是使用什么算法根据迭代器步骤之间的连续时间自动选择最佳批处理长度?

这是我到目前为止所做的 - 它还没有完成......

class LengthOptimizer
{
    private Stopwatch sw;
    private int length = 1;
    private List<RateRecord> rateRecords = new List<RateRecord>();

    public int Length
    {
        get
        {
            if (sw == null)
            {
                length = 1;
                sw = new Stopwatch();
            }
            else
            {
                sw.Stop();
                rateRecords.Add(new RateRecord { Length = length, ElapsedMilliseconds = sw.ElapsedMilliseconds });
                length = rateRecords.OrderByDescending(c => c.Rate).First().Length;
            }
            sw.Start();
            return length;
        }
    }
}

struct RateRecord
{
    public int Length { get; set; }
    public long ElapsedMilliseconds { get; set; }
    public float Rate { get { return ((float)Length) / ElapsedMilliseconds; } }
}

3 个答案:

答案 0 :(得分:1)

我在这里看到的主要问题是创建&#34;优化规模&#34;,也就是说,为什么你认为32 - &gt; 119ms是可以接受的并且256 - > 1100毫秒不是;或者为什么某些配置比其他配置更好。

一旦完成,算法将很简单:只返回每个输入条件的排名值,并根据&#34做出决策;哪一个获得更高的值&#34;。

创建此比例的第一步是找出更好地描述您正在寻找的理想行为的变量。一个简单的第一种方法:长度/时间。也就是说,从你的输入:

batch length           time             ratio1
    1                 100ms              0.01
    2                 102ms              0.019  
    4                 110ms              0.036  
    8                 111ms              0.072
    16                118ms              0.136
    32                119ms              0.269  
    64                134ms              0.478
    128               500ms              0.256
    256               1100ms             0.233

比率1越大越好。从逻辑上讲,如果0.269的长度为32,而不是0.256,则为128,因此必须考虑更多的信息。

您可以创建更复杂的排名比率,更好地加权两个相关变量(例如,尝试不同的指数)。但我认为解决这个问题的最佳方法是创建一个&#34; zone&#34;并从中计算一般排名。例如:

Zone 1 -> length from 1 to 8; ideal ratio for this zone = 0.1
Zone 2 -> length from 9 to 32; ideal ratio for this zone = 0.3
Zone 3 -> length from 33 to 64; ideal ratio for this zone = 0.45
Zone 4 -> length from 65 to 256; ideal ratio for this zone = 0.35

与每个配置相关联的排名将是给定ratio1相对于给定区域的理想值的结果。

2      102ms        0.019 -> (zone 1) 0.019/0.1 = 0.19 (or 1.9 in a 0-10 scale)
16     118ms        0.136 -> (zone 2) 0.136/0.3 = 0.45 (or 4.5 in a 0-10 scale)  
etc.

这些值可能会被比较,因此您会自动知道第二种情况比第一种情况要好得多。

这只是一个简单的例子,但我想这可以提供一个很好的洞察力来解决这里真正的问题:设置一个准确的排名,以便完美地识别哪个配置更好。

答案 1 :(得分:1)

我会选择像varocarbas建议的排名方法。

这是一个初步实现,可以帮助您入门:

public sealed class DataFlowOptimizer<T>
{
    private readonly IEnumerable<T> _collection;
    private RateRecord bestRate = RateRecord.Default;
    private uint batchLength = 1;

    private struct RateRecord
    {
        public static RateRecord Default = new RateRecord { Length = 1, ElapsedTicks = 0 };
        private float _rate;

        public int Length { get; set; }
        public long ElapsedTicks { get; set; }
        public float Rate
        {
            get
            {
                if(_rate == default(float) && ElapsedTicks > 0)
                {
                    _rate = ((float)Length) / ElapsedTicks;
                }

                return _rate;
            }
        }
    }

    public DataFlowOptimizer(IEnumerable<T> collection)
    {
        _collection = collection;
    }

    public int BatchLength { get { return (int)batchLength; } }
    public float Rate { get { return bestRate.Rate; } }

    public IEnumerable<IList<T>> GetBatch()
    {
        var stopwatch = new Stopwatch();

        var batch = new List<T>();
        var benchmarks = new List<RateRecord>(5);
        IEnumerator<T> enumerator = null;

        try
        {
            enumerator = _collection.GetEnumerator();

            uint count = 0;
            stopwatch.Start();

            while(enumerator.MoveNext())
            {   
                if(count == batchLength)
                {
                    benchmarks.Add(new RateRecord { Length = BatchLength, ElapsedTicks = stopwatch.ElapsedTicks });

                    var currentBatch = batch.ToList();
                    batch.Clear();

                    if(benchmarks.Count == 10)
                    {
                        var currentRate = benchmarks.Average(x => x.Rate);
                        if(currentRate > bestRate.Rate)
                        {
                            bestRate = new RateRecord { Length = BatchLength, ElapsedTicks = (long)benchmarks.Average(x => x.ElapsedTicks) };
                            batchLength = NextPowerOf2(batchLength);
                        }
                        // Set margin of error at 10%
                        else if((bestRate.Rate * .9) > currentRate)
                        {
                            // Shift the current length and make sure it's >= 1
                            var currentPowOf2 = ((batchLength >> 1) | 1);
                            batchLength = PreviousPowerOf2(currentPowOf2);
                        }

                        benchmarks.Clear();
                    }
                    count = 0;
                    stopwatch.Restart();

                    yield return currentBatch;
                }

                batch.Add(enumerator.Current);
                count++;
            }
        }
        finally
        {
            if(enumerator != null)
                enumerator.Dispose();
        }

        stopwatch.Stop();
    }

    uint PreviousPowerOf2(uint x)
    {
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);

        return x - (x >> 1);
    }

    uint NextPowerOf2(uint x)
    {
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);

        return (x+1);
    }
}

LinqPad中的示例程序:

public IEnumerable<int> GetData()
{
    return Enumerable.Range(0, 100000000);
}

void Main()
{
    var optimizer = new DataFlowOptimizer<int>(GetData());

    foreach(var batch in optimizer.GetBatch())
    {
        string.Format("Length: {0} Rate {1}", optimizer.BatchLength, optimizer.Rate).Dump();
    }
}

答案 2 :(得分:0)

  1. 描述目标函数 f,将批量大小s和运行时t(s)映射到得分f(s,t(s))
  2. 尝试大量s个值并为每个值f(s,t(s))评估
  3. 选择最大化s
  4. f