我一直在阅读很多关于线程的内容,但无法弄清楚如何找到我的问题的解决方案。 首先让我介绍一下这个问题。我有需要处理的文件。主机名和文件路径位于两个数组中。
现在我想设置几个线程来处理文件。要创建的线程数基于三个因素:
A)最大线程数不能超过所有方案中唯一主机名的数量
B)顺序处理具有相同主机名必须的文件。 I.E我们无法同时处理 host1 _file1和 host1 _file2。 (数据完整性将受到威胁,这超出了我的控制范围
C)用户可以限制可用于处理的线程数。线程数仍受上述条件A的限制。这纯粹是因为如果我们有大量的主机让我们说50 ..我们可能不希望同时处理50个线程。
在上面的示例中,最多可以创建6个线程。
最佳处理程序如下所示。
public class file_prep_obj
{
public string[] file_paths;
public string[] hostname;
public Dictionary<string, int> my_dictionary;
public void get_files()
{
hostname = new string[]{ "host1", "host1", "host1", "host2", "host2", "host3", "host4","host4","host5","host6" };
file_paths=new string[]{"C:\\host1_file1","C:\\host1_file2","C:\\host1_file3","C:\\host2_file1","C:\\host2_file2","C:\\host2_file2",
"C:\\host3_file1","C:\\host4_file1","C:\\host4_file2","C:\\host5_file1","C:\\host6_file1"};
//The dictionary provides a count on the number of files that need to be processed for a particular host.
my_dictionary = hostname.GroupBy(x => x)
.ToDictionary(g => g.Key,
g => g.Count());
}
}
//This class contains a list of file_paths associated with the same host.
//The group_file_host_name will be the same for a host.
class host_file_thread
{
public string[] group_file_paths;
public string[] group_file_host_name;
public void process_file(string file_path_in)
{
var time_delay_random=new Random();
Console.WriteLine("Started processing File: " + file_path_in);
Task.Delay(time_delay_random.Next(3000)+1000);
Console.WriteLine("Completed processing File: " + file_path_in);
}
}
class Program
{
static void Main(string[] args)
{
file_prep_obj my_files=new file_prep_obj();
my_files.get_files();
//Create our host objects... my_files.my_dictionary.Count represents the max number of threads
host_file_thread[] host_thread=new host_file_thread[my_files.my_dictionary.Count];
int key_pair_count=0;
int file_path_position=0;
foreach (KeyValuePair<string, int> pair in my_files.my_dictionary)
{
host_thread[key_pair_count] = new host_file_thread(); //Initialise the host_file_thread object. Because we have an array of a customised object
host_thread[key_pair_count].group_file_paths=new string[pair.Value]; //Initialise the group_file_paths
host_thread[key_pair_count].group_file_host_name=new string[pair.Value]; //Initialise the group_file_host_name
for(int j=0;j<pair.Value;j++)
{
host_thread[key_pair_count].group_file_host_name[j]=pair.Key.ToString(); //Group the hosts
host_thread[key_pair_count].group_file_paths[j]=my_files.file_paths[file_path_position]; //Group the file_paths
file_path_position++;
}
key_pair_count++;
}//Close foreach (KeyValuePair<string, int> pair in my_files.my_dictionary)
//TODO PROCESS FILES USING host_thread objects.
}//Close static void Main(string[] args)
}//Close Class Program
我想我所追求的是如何编写符合上述规范的线程处理例程的指南。
答案 0 :(得分:2)
您可以使用Stephen Toub的ForEachAsync扩展方法来处理文件。它允许您指定要使用的并发线程数,并且它是非阻塞的,因此它可以释放主线程以执行其他处理。这是文章中的方法:
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
为了使用它,我稍微重构了你的代码。我将字典更改为Dictionary<string, List<string>>
类型,它基本上将主机作为键,然后将所有路径作为值。我假设文件路径中包含主机名。
my_dictionary = (from h in hostname
from f in file_paths
where f.Contains(h)
select new { Hostname = h, File = f }).GroupBy(x => x.Hostname)
.ToDictionary(x => x.Key, x => x.Select(s => s.File).Distinct().ToList());
我还将您的process_file
方法更改为async
,因为您在Task.Delay
内使用了await
,您需要public static async Task process_file(string file_path_in)
{
var time_delay_random = new Random();
Console.WriteLine("Started:{0} ThreadId:{1}", file_path_in, Thread.CurrentThread.ManagedThreadId);
await Task.Delay(time_delay_random.Next(3000) + 1000);
Console.WriteLine("Completed:{0} ThreadId:{1}", file_path_in, Thread.CurrentThread.ManagedThreadId);
}
,否则它无法执行任何操作。
my_files.my_dictionary.ForEachAsync
要使用该代码,您将获得要使用的最大线程数,并将其传递给public static async Task MainAsync()
{
var my_files = new file_prep_obj();
my_files.get_files();
const int userSuppliedMaxThread = 5;
var maxThreads = Math.Min(userSuppliedMaxThread, my_files.my_dictionary.Values.Count());
Console.WriteLine("MaxThreads = " + maxThreads);
foreach (var pair in my_files.my_dictionary)
{
foreach (var path in pair.Value)
{
Console.WriteLine("Key= {0}, Value={1}", pair.Key, path);
}
}
await my_files.my_dictionary.ForEachAsync(maxThreads, async (pair) =>
{
foreach (var path in pair.Value)
{
// serially process each path for a particular host.
await process_file(path);
}
});
}
static void Main(string[] args)
{
MainAsync().Wait();
Console.ReadKey();
}//Close static void Main(string[] args)
。您还提供了一个异步委托,它处理特定主机的每个文件,并按顺序等待每个文件进行处理。
MaxThreads = 5
Key= host1, Value=C:\host1_file1
Key= host1, Value=C:\host1_file2
Key= host1, Value=C:\host1_file3
Key= host2, Value=C:\host2_file1
Key= host2, Value=C:\host2_file2
Key= host3, Value=C:\host3_file1
Key= host4, Value=C:\host4_file1
Key= host4, Value=C:\host4_file2
Key= host5, Value=C:\host5_file1
Key= host6, Value=C:\host6_file1
Started:C:\host1_file1 ThreadId:10
Started:C:\host2_file1 ThreadId:12
Started:C:\host3_file1 ThreadId:13
Started:C:\host4_file1 ThreadId:11
Started:C:\host5_file1 ThreadId:10
Completed:C:\host1_file1 ThreadId:13
Completed:C:\host2_file1 ThreadId:12
Started:C:\host1_file2 ThreadId:13
Started:C:\host2_file2 ThreadId:12
Completed:C:\host2_file2 ThreadId:11
Completed:C:\host1_file2 ThreadId:13
Started:C:\host6_file1 ThreadId:11
Started:C:\host1_file3 ThreadId:13
Completed:C:\host5_file1 ThreadId:11
Completed:C:\host4_file1 ThreadId:12
Completed:C:\host3_file1 ThreadId:13
Started:C:\host4_file2 ThreadId:12
Completed:C:\host1_file3 ThreadId:11
Completed:C:\host6_file1 ThreadId:13
Completed:C:\host4_file2 ThreadId:12
<强>输出继电器强>
{{1}}
答案 1 :(得分:1)
我正在解决你的问题并想出了以下方法。它可能不是最好的,但我相信它符合您的需求。
在开始之前,我是扩展方法的忠实粉丝,所以这里有一个:
public static class IEnumerableExtensions
{
public static void Each<T>(this IEnumerable<T> ie, Action<T, int> action)
{
var i = 0;
foreach (var e in ie) action(e, i++);
}
}
这样做是循环一个集合(foreach),但保留项目和索引。你会明白为什么以后需要这样做。
然后我们有变量。
public static string[] group_file_paths =
{
"host1", "host1", "host1", "host2", "host2", "host3", "host4", "host4",
"host5", "host6"
};
public static string[] group_file_host_name =
{
@"c:\\host1_file1", @"c:\\host1_file2", @"c:\\host1_file3", @"c:\\host2_file1", @"c:\\host2_file2", @"c:\\host3_file1",
@"c:\\host4_file1", @"c:\\host4_file2", @"c:\\host5_file1", @"c:\\host5_file2", @"c:\\host6_file1"
};
然后是主要代码:
public static void Main(string[] args)
{
Dictionary<string, List<string>> filesToProcess = new Dictionary<string, List<string>>();
// Loop over the 2 arrays and creates a directory that contains the host as the key, and then all the filenames.
group_file_paths.Each((host, hostIndex) =>
{
if (filesToProcess.ContainsKey(host))
{ filesToProcess[host].Add(group_file_host_name[hostIndex]); }
else
{
filesToProcess.Add(host, new List<string>());
filesToProcess[host].Add(group_file_host_name[hostIndex]);
}
});
var tasks = new List<Task>();
foreach (var kvp in filesToProcess)
{
tasks.Add(Task.Factory.StartNew(() =>
{
foreach (var file in kvp.Value)
{
process_file(kvp.Key, file);
}
}));
}
var handleTaskCompletionTask = Task.WhenAll(tasks);
handleTaskCompletionTask.Wait();
}
这里可能需要一些解释:
所以我创建了一个包含主机作为键的字典,以及一个需要处理的文件列表的值。
您的词典将如下所示:
之后,我创建了一系列将使用TPL执行的任务。 我现在执行所有任务,并且等待所有任务完成。
您的流程方法如下所示,仅用于测试目的:
public static void process_file(string host, string file)
{
var time_delay_random = new Random();
Console.WriteLine("Host '{0}' - Started processing the file {1}.", host, file);
Thread.Sleep(time_delay_random.Next(3000) + 1000);
Console.WriteLine("Host '{0}' - Completed processing the file {1}.", host, file);
Console.WriteLine("");
}
这篇文章不包括自己设置线程的方法,但可以通过在任务上使用完成处理程序轻松实现。比任何任务完成时,您可以再次循环收集并开始一项尚未完成的新任务。
所以,我希望它有所帮助。
答案 2 :(得分:1)
我首先要更好地组织您的数据结构。拥有两个独立的数组不仅会增加数据重复,还会产生隐式耦合,这对于查看代码的人来说可能并不明显。
一个包含单个任务信息的类可能类似于:
public class TaskInfo
{
private readonly string _hostName;
public string HostName
{
get { return _hostName; }
}
private readonly ReadOnlyCollection<string> _files;
public ReadOnlyCollection<string> Files
{
get { return _files; }
}
public TaskInfo(string host, IEnumerable<string> files)
{
_hostName = host;
_files = new ReadOnlyCollection<string>(files.ToList());
}
}
创建任务列表现在更加简单:
var list = new List<TaskInfo>()
{
new TaskInfo(
host: "host1",
files: new[] { @"c:\host1\file1.txt", @"c:\host1\file2.txt" }),
new TaskInfo(
host: "host2",
files: new[] { @"c:\host2\file1.txt", @"c:\host2\file2.txt" })
/* ... */
};
现在您已准备好任务,您可以简单地使用System.Threading.Tasks
命名空间中的各种类来并行调用它们。如果您真的想限制并发任务的数量,可以使用MaxDegreeOfParallelism属性:
Parallel.ForEach(
list,
new ParallelOptions() { MaxDegreeOfParallelism = 10 },
taskInfo => Process(taskInfo)
);
如果你想创建自己的线程池,你也可以使用带有多个消费者线程的ConcurrentQueue
来实现类似的事情,可能在等待WaitHandle
列表以了解它们何时和#39;重新完成。
答案 3 :(得分:0)
我认为ThreadPool
是您的完美解决方案。它将自己处理线程并排队工作。此外,您可以设置最大线程限制,即使您有超过最大线程数,它仍会对您的工作进行排队。
ThreadPool.SetMaxThreads([YourMaxThreads],[YourMaxThreads]);
foreach (var t in host_thread)
{
ThreadPool.QueueUserWorkItem(Foo, t);
}
private static void Foo(object thread)
{
foreach (var file in (thread as host_file_thread).group_file_paths)
{
(thread as host_file_thread).process_file(file);
}
}
虽然我建议您更改数据结构并保留process_file
方法