用例是:我有一个巨大的日志文件,我通过chunk(相同大小,IO读取)读取主线程块。在我的测试机器中,每个块读取大约需要1秒。在读取每个块之后,我使用线程池为每个块创建一个线程,将其放入2个数据库实例中。现在我有两个挑战:
我必须将块插入2个DBS。即奇数块进入第一个DB,甚至块进入第二个DB。我在块模型中没有任何东西可以表示我可以依赖的块数。我尝试在该块模型上创建一个包装器,以获得" chunkCount"但是我在哪里增加chunkCount?
如何衡量线程池中不同线程上运行的每个插入的时间?
以下代码我尝试了实验,但它没有产生任何结果:
logEventsChunk = logFetcher.GetNextLogEventsChunk();
chunkModel = new LogEventChunkModel();
stw = new Stopwatch();
chunkModel.ChunkCount = chunkCount;
chunkModel.LogeventChunk = logEventsChunk;
//chunkCount++;
ThreadPool.QueueUserWorkItem(new WaitCallback(delegate(object state)
{ InsertChunk(chunkModel, collection, secondCollection, stw); }), null);
InsertChunk方法在这里:
private void InsertChunk(LogEventChunkModel logEventsChunk, MongoCollection<LogEvent> collection, MongoCollection<LogEvent> secondCollection,Stopwatch stw)
{
chunkCount++;
stw.Start();
MongoInsertOptions options = new MongoInsertOptions();
options.WriteConcern = WriteConcern.Unacknowledged;
options.CheckElementNames = true;
string db = string.Empty;
{
//DateTime dtWrite = DateTime.Now;
if (logEventsChunk.ChunkCount % 2 == 0)
{
DateTime dtWrite1 = DateTime.Now;
collection.InsertBatch(logEventsChunk.LogeventChunk.LogEvents, options);
db = "FirstDB";
//Console.WriteLine("Time taken to write the chunk: " + DateTime.Now.Subtract(dtWrite1).TotalSeconds.ToString() + " s. " + db);
}
else
{
DateTime dtWrite2 = DateTime.Now;
secondCollection.InsertBatch(logEventsChunk.LogeventChunk.LogEvents, options);
db = "SecondDB";
//Console.WriteLine("Time taken to write the chunk: " + DateTime.Now.Subtract(dtWrite2).TotalSeconds.ToString() + " s. " + db);
}
Console.WriteLine("Thread Completed: {0} **********", Thread.CurrentThread.GetHashCode() );
stw.Stop();
Console.WriteLine("Time taken to write the chunk: " + stw.ElapsedMilliseconds + " ms. " + db + " Chunk Count: " + logEventsChunk.ChunkCount);
stw.Reset();
//+ "Chunk Count: " + chunkCount.ToString()
//Console.WriteLine("Time taken to write the chunk: " + DateTime.Now.Subtract(dtWrite).TotalSeconds.ToString() + " s. "+db);
//mongoDBInsertionTotalTime += DateTime.Now.Subtract(dtWrite).TotalSeconds;
}
}
请忽略这些注释行,因为它们只是某些实验的一部分。
答案 0 :(得分:1)
不是为每个插入启动一个新线程,而是试图让线程找出要写入哪个数据库,而是启动两个持久线程,每个线程都写入一个数据库。这些线程从队列中获取数据。这是使用BlockingCollection<T>
的非常标准的生产者/消费者设置。
所以,你有:
// Maximum number of items in queue (to avoid out of memory errors)
const int MaxQueueSize = 10000;
BlockingCollection<LogEventChunkModel> Db1Queue = new BlockingCollection<LogEventChunkModel>(MaxQueueSize);
BlockingCollection<LogEventChunkModel> Db2Queue = new BlockingCollection<LogEventChunkModel>(MaxQueueSize);
在主线程中,启动数据库更新线程:
var t1 = new Thread(DbWriteThreadProc);
t1.Start(new Tuple<string, BlockingCollection<LogEventChunkModel>>("FirstDB", Db1Queue));
var t2 = new Thread(DbWriteThreadProc);
t2.Start(new Tuple<string, BlockingCollection<LogEventChunkModel>>("SecondDb", Db2Queue));
然后,开始阅读日志文件并将备用块放入队列中:
int chunk = 0;
while (!EndOfLogFile)
{
var chunk = GetNextChunk();
if ((chunk % 0) == 0)
Db1Queue.Add(chunk);
else
Db2Queue.Add(chunk);
++chunk;
}
// end of data, so mark the queues as complete
Db1Queue.CompleteAdding();
Db2Queue.CompleteAdding();
// and wait for threads to complete processing the queues
t1.Join();
t2.Join();
你的写线程proc非常简单。它只是服务队列并写入数据库:
void DbWriteThreadProc(object state)
{
// passed object is a Tuple<string, BlockingCollection>
// Get the items from it
var threadData = (Tuple<string, BlockingCollection>)state;
string dbName = threadData.Item1;
BlockingCollection<LogEventChunk> queue = threadData.Item2;
// now read the queue and write to the database
foreach (var chunk in queue.GetConsumingEnumerable())
{
var sw = Stopwatch.StartNew();
// write chunk to the database.
sw.Stop();
Console.WriteLine("Time to write = {0:N0} ms", sw.ElapsedMilliseconds);
}
}
GetConsumingEnumerable
在队列上进行非忙等待,因此不会继续轮询。当队列为空时,循环将完成和队列被标记为完成以进行添加(这就是主线程调用CompleteAdding
的原因)。
这种方法比你的方法有几个优点。特别是,它简化了确定写入哪些数据库块。此外,它最多使用三个线程,并保证将块以与从日志文件中读取的顺序相同的顺序添加到数据库中。使用QueueUserWorkItem
的方法无法保证插入顺序。它还为每次插入创建一个新线程,最终可能会有大量的并发线程。