Question

我的数据库中有1000万条记录，我需要随机更新所有20行值因此，对于每个随机的5000万条记录，需要更新1个值因此，我想到了生成一个包含1000万个数字的列表，并从该列表中随机选择了5000万条记录，并从该列表中删除了5000万条记录，依此类推。

我的代码：

列表创建：

List<long> LstMainList = new List<long>();
for (int i = 1; i <= 999999999; i++)
{
   LstMainList.Add(i);
}

新空列表： List<TableData> Table1 = new List<TableData>();

选择随机数并将其添加到新列表并从MainList中删除包含1000万个项目的项目。

Random rand = new Random();

for (int a = 0; a < 50000000; a++)
{
     int lstindex = rand.Next(LstMainList.Count);

     Int64 lstData = LstMainList[lstindex];

     Table1.Add(new TableData { MESSAGE_ID = lstData });

     LstMainList.RemoveAt(lstindex);

     if (a % 100000 == 0)
     {
         if (previousThread != null)
         {
              previousThread.Join();
         }

          List<TableData> copyList = Table1.ToList();

          previousThread = new Thread(() => BulkCopyList(copyList, "PLAN_TABLE_1"));
          previousThread.Start();

          Table1.Clear();
       }
}

现在，我的问题是：在LstMainList.RemoveAt(lstindex);行，需要很长时间才能从MainList中删除索引，因为它包含1000万条记录。

有没有办法以简单的方式从List中删除记录？或任何其他方式使这个简单？

Answer 1

首先 - 使用数组代替列表而不是列表（特别是没有初始化容量）

int idsCount = 100000000;
long[] ids = new long[idsCount];

for(long i = 1; i < idsCount; i++)
    ids[i] = i;

使用Fisher–Yates shuffle在数组中随机播放id

Random rnd = new Random();
int n = idsCount;
while(n > 1)
{
    int k = rnd.Next(n);
    n--;
    long temp = ids[n];
    ids[n] = ids[k];
    ids[k] = temp;
}

使用混乱ID，您无需修改ID列表。在随机位置移除物品是非常昂贵的操作。如果删除位置0处的项目，则应将整个列表复制到新数组。现在你可以迭代ids数组。

或者您可以使用morelinq Batch创建批量的TableData并批量处理它们：

int size = 100000;
foreach(var batch in ids.Batch(size, id => new TableData { MESSAGE_ID = id }))
{
   var copyList = batch.ToList();
   // ...
}

更新：因此您需要不同大小的批次，您可以使用以下扩展方法从数组中获取项目范围：

public static IEnumerable<T> GetRange<T>(
     this T[] array, int startIndex, int count)
{
    for (int i = startIndex; i < startIndex + count; i++)
        yield return array[i];
}

因此，从索引20000开始获取5000 TableData将如下所示：

var copyList = ids.GetRange(20000, 5000)
                  .Select(id => new TableData { MESSAGE_ID = id })
                  .ToList();

当然，更有效的方法是迭代ids数组，并使用预初始化容量将项目添加到列表中：

int size = 5000;
int startIndex = 20000;
List<TableData> copyList = new List<TableData>(size);
for (int i = startIndex; i < startIndex + size; i++)
    copyList.Add(new TableData { MESSAGE_ID = ids[i] });

更进一步，我会将TableData对象创建移动到执行批量复制的线程。并且只是传递了它应该使用的id序列。

Answer 2

首先，这是some advice from Microsoft about selecting rows randonly from a large table。

其次，如果没用，请继续阅读...

如果您知道要随机选择的项目数，以及您想要随机选择的序列中的项目数，则可以使用O（N）解决方案。

在下面的示例中，方法RandomlySelectedItems<T>()提供了一系列随机选择的项目。

这是代码。（重申一下，如果您事先知道要选择的项目数量，则只能使用此项目）：

using System;
using System.Collections.Generic;
using System.Linq;

namespace Demo
{
    internal static class Program
    {
        static void Main(string[] args)
        {
            int numberOfValuesToSelectFrom = 10000000;
            int numberOfValuesToSelect = 20;
            var valuesToSelectFrom = Enumerable.Range(1, numberOfValuesToSelectFrom);

            var selectedValues = RandomlySelectedItems
            (
                valuesToSelectFrom, 
                numberOfValuesToSelect, 
                numberOfValuesToSelectFrom, 
                new Random()
           );

            foreach (int value in selectedValues)
                Console.WriteLine(value);
        }

        /// <summary>Randomly selects items from a sequence.</summary>
        /// <typeparam name="T">The type of the items in the sequence.</typeparam>
        /// <param name="sequence">The sequence from which to randomly select items.</param>
        /// <param name="count">The number of items to randomly select from the sequence.</param>
        /// <param name="sequenceLength">The number of items in the sequence among which to randomly select.</param>
        /// <param name="rng">The random number generator to use.</param>
        /// <returns>A sequence of randomly selected items.</returns>
        /// <remarks>This is an O(N) algorithm (N is the sequence length).</remarks>

        public static IEnumerable<T> RandomlySelectedItems<T>(IEnumerable<T> sequence, int count, int sequenceLength, Random rng)
        {
            if (sequence == null)
                throw new ArgumentNullException("sequence");

            if (count < 0 || count > sequenceLength)
                throw new ArgumentOutOfRangeException("count", count, "count must be between 0 and sequenceLength");

            if (rng == null)
                throw new ArgumentNullException("rng");

            int available = sequenceLength;
            int remaining = count;
            var iterator  = sequence.GetEnumerator();

            for (int current = 0; current < sequenceLength; ++current)
            {
                iterator.MoveNext();

                if (rng.NextDouble() < remaining/(double)available)
                {
                    yield return iterator.Current;
                    --remaining;
                }

                --available;
            }
        }
    }
}

Answer 3

一种选择是不尝试生成真正的甚至是伪随机数，而是使用对于不经意的观察者而言显然是随机的序列。这可以在很多情况下工作，但是如果需要随机选择项目以防止攻击者预测下一个值，它将无法工作。好处是你不需要跟踪内存中所有生成的值来改变它们。

首先，选择两个随机素数（a，b）小于行数（r），使a * b＆gt; r和a不分r。映射f（x）= a * x + b mod r保证在环Z [r]中是一对一的。我们将使用该事实生成一个序列，其中每个值在0到r - 1之间是唯一的。

让我们选择两个随机素数，比如a = 11268619和b = 4064861.然后你可以生成＆＃34;随机序列＆＃34;数字范围为0到1e9-1：

private static IEnumerable<int> GenerateSequence()
{
    const int max = 1000000000;
    const long a = 11268619, b = 4064861;

    for(int i = 0; i < max; i++)
    {
        int c = (int)((a * i + b) % max);
        yield return c;
    }
}

List的性能问题

3 个答案: