Question

我想处理一些数据。我在字典中有大约25k项。在foreach循环中，我查询数据库以获得该项目的结果。它们被添加为字典的值。

foreach (KeyValuePair<string, Type> pair in allPeople)
{
    MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
    MySqlDataReader reader2 = comd.ExecuteReader();
    Dictionary<string, Dictionary<int, Log>> allViews = new Dictionary<string, Dictionary<int, Log>>();
    while (reader2.Read())
    {
        if (!allViews.ContainsKey(reader2.GetString("src")))
        {
            allViews.Add(reader2.GetString("src"), reader2.GetInt32("time"));
        }
    }
    reader2.Close();
    reader2.Dispose();
    allPeople[pair.Key].View = allViews;
}

我希望能够通过多线程更快地完成这项工作。我有8个线程可用，CPU使用率约为13％。我只是不知道它是否会起作用，因为它依赖于MySQL服务器。另一方面，也许8个线程可以打开8个DB连接，因此更快。

无论如何，如果多线程对我的情况有帮助，怎么样？ o.O我从未使用（多个）线程，所以任何帮助都会很棒：D

Answer 1

MySqlDataReader是有状态的 - 您在其上调用Read()并移动到下一行，因此每个线程都需要自己的读取器，并且您需要编写查询以使它们获得不同的值。这可能不会太难，因为你自然会有很多不同的pair.Key值的查询。

您还需要为每个线程创建一个临时字典，然后将它们合并，或使用锁定来防止对字典进行并发修改。

以上假设MySQL允许单个连接执行并发查询;否则你也可能需要多个连接。

首先，我会看到如果您只向数据库询问所需数据（"SELECT src,time FROM logs WHERE IP = '" + pair.Key + "' GROUP BY src"）并使用GetString（0）和GetInt32（1）而不是使用查找src和时间的名称;也只从结果中获取一次值。

我也不确定逻辑 - 你没有按时间排序日志事件，所以哪一个是第一个返回的（因此存储在字典中）可能是其中任何一个。

像这样的逻辑 - 每个 N 线程只在 N 对上运行，每个线程都有自己的读者，实际上没有任何变化{{1} }，只有allPeople中的值的属性：

allPeople

这未经过测试 - 我在这台机器上没有MySQL，也没有你的数据库以及你正在使用的其他类型。它也是程序性的（如何在Fortran中使用OpenMPI进行），而不是将所有内容包装在任务对象中。

你可以这样启动线程：

    private void RunSubQuery(Dictionary<string, Type> allPeople, MySqlConnection con, int threadNumber, int threadCount)
    {
        int hoppity = 0; // used to hop over the keys not processed by this thread

        foreach (var pair in allPeople)
        {
            // each of the (threadCount) threads only processes the (threadCount)th key
            if ((hoppity % threadCount) == threadNumber)
            {
                // you may need con per thread, or it might be that you can share con; I don't know
                MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);

                using (MySqlDataReader reader = comd.ExecuteReader())
                {
                    var allViews = new Dictionary<string, Dictionary<int, Log>>();

                    while (reader.Read())
                    {
                        string src = reader.GetString(0);
                        int time = reader.GetInt32(1);

                        // do whatever to allViews with src and time
                    }

                    // no thread will be modifying the same pair.Value, so this is safe
                    pair.Value.View = allViews;
                }
            }

            ++hoppity;
        }
    }

在allPeople上保留额外的锁定，以便在所有线程返回后有一个写屏障;我不太确定是否需要它。任何对象都可以。

这不能保证任何性能提升 - 可能是MySQL库是单线程的，但服务器当然可以处理多个连接。使用不同数量的线程进行测量。

如果你正在使用.net 4，那么你不必乱用创建线程或跳过你没有工作的项目：

    void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
    {
        lock (allPeople)
        {
            const int threadCount = 8; // the number of threads

            // if it takes 18 seconds currently and you're not at .net 4 yet, then you may as well create
            // the threads here as any saving of using a pool will not matter against 18 seconds
            //
            // it could be more efficient to use a pool so that each thread takes a pair off of 
            // a queue, as doing it this way means that each thread has the same number of pairs to process,
            // and some pairs might take longer than others
            Thread[] threads = new Thread[threadCount];

            for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
            {
                threads[threadNumber] = new Thread(new ThreadStart(() => RunSubQuery(allPeople, connection, threadNumber, threadCount)));
                threads[threadNumber].Start();
            }

            // wait for all threads to finish
            for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
            {
                threads[threadNumber].Join();
            }
        }
    }

Answer 2

我想到的最大问题是您将使用多线程向字典添加值，这不是线程安全的。

你必须做一些事情like这才能让它发挥作用，你可能无法从实现它中获得太多好处，因为它仍然需要锁定字典对象来添加值。

Answer 3

在您做任何其他事情之前，请确切了解花费的时间。检查查询的执行计划。我怀疑的第一件事是logs.IP上缺少索引。

对于这样的事情，18分钟对我来说似乎太长了。即使您可以通过添加更多线程（这不太可能！）将执行时间减少到8，您仍然会使用超过2分钟。您可以在不到五秒的时间内将整个25k行读入内存并在内存中进行必要的处理......

编辑：只是为了澄清，我并不是在提倡实际在记忆中这样做，只是说这里看起来有一个更大的瓶颈可以被删除。

Answer 4

假设：

你的桌子上有一张桌子数据库
里面有很多人你的数据库

每个数据库查询都会增加您为数据库中的每个人执行一次数据库查询的开销。我建议在一次查询中获取所有数据然后重复调用会更快

select l.ip,l.time,l.src 
  from logs l, people p 
  where l.ip = p.ip
  group by l.ip, l.src

尝试使用单个线程中的循环，我相信这将比现有代码快得多。

在现有代码中，您可以做的另一件事是将MySqlCommand创建出循环，提前准备并只更改参数。这应该加快SQL的执行速度。见http://dev.mysql.com/doc/refman/5.0/es/connector-net-examples-mysqlcommand.html#connector-net-examples-mysqlcommand-prepare

MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = ?key GROUP BY src", con);
comd.prepare();
comd.Parameters.Add("?key","example");
foreach (KeyValuePair<string, Type> pair in allPeople)
{
    comd.Parameters[0].Value = pair.Key;

如果您使用多个线程，每个线程仍然需要自己的命令，至少在MS-SQL中，即使您每次都重新创建并准备好该语句，这仍然会更快，因为SQL服务器能够能够缓存参与标准的执行计划。

Answer 5

我认为如果你在多核机器上运行它，你可以从多线程中获益。

然而，我接近它的方法是首先通过进行异步数据库调用来查看解锁当前使用的线程。回调函数将在后台线程上执行，因此您将获得一些多核心优势，并且您不会阻止线程等待数据库返回。

对于像此示例的IO密集型应用程序，听起来您可能会看到改进的吞吐量，具体取决于数据库可以处理的负载。假设db扩展以处理多个并发请求，那么你应该很好。

Answer 6

感谢大家的帮助。目前我正在使用此

for (int i = 0; i < 8; i++)
{
    ThreadPool.QueueUserWorkItem(addDistinctScres, i);
}

ThreadPool来运行所有线程。我使用Pete Kirkham提供的方法，我正在为每个线程创建一个新连接。时间下降到4分钟。

接下来我会做一些等待线程池的回调？在执行其他功能之前。

我认为现在的瓶颈是MySQL服务器，因为CPU使用量已经下降。

@odd平价我想到了，但真实的东西是超过25k行。如果那样的话，请点击。

Answer 7

这听起来像是map / reduce的完美工作，我不是.Net程序员，但这似乎是一个合理的指南： http://ox.no/posts/minimalistic-mapreduce-in-net-4-0-with-the-new-task-parallel-library-tpl

foreach循环上的多线程？

7 个答案: