同时和异步处理大量任务

时间:2016-12-12 05:34:45

标签: c#

我想通过网络服务处理50,000个网址列表,此服务的提供商每秒允许5个网址。

我需要在遵守提供商规则的同时处理这些网址。

这是我目前的代码:

static void Main(string[] args)
{
    process_urls().GetAwaiter().GetResult();

}
public static async Task process_urls()
{
    // let's say there is a list of 50,000+ URLs
    var urls = System.IO.File.ReadAllLines("urls.txt");

    var allTasks = new List<Task>();
    var throttler = new SemaphoreSlim(initialCount: 5);

    foreach (var url in urls)
    {
        await throttler.WaitAsync();

        allTasks.Add(
            Task.Run(async () =>
            {
                try
                {
                    Console.WriteLine(String.Format("Starting {0}", url));
                    var client = new HttpClient();
                    var xml = await client.GetStringAsync(url);
                    //do some processing on xml output
                    client.Dispose();
                }
                finally
                {
                    throttler.Release();
                }
            }));
    }   
    await Task.WhenAll(allTasks);   
}

而不是var client = new HttpClient();我将创建目标Web服务的新对象,但这只是为了使代码通用。

这是处理和处理大量连接的正确方法吗?无论如何我可以将每秒建立的连接数限制为5,因为当前的实现不考虑任何时间范围?

由于

1 个答案:

答案 0 :(得分:2)

从Web服务读取值是IO操作,可以异步完成而无需多线程 线程什么也不做 - 只在这种情况下等待响应。所以使用parallel只是浪费资源。

public static async Task process_urls()
{
    var urls = System.IO.File.ReadAllLines("urls.txt");

    var allTasks = new List<Task>();
    var throttler = new SemaphoreSlim(initialCount: 5);

    foreach (var urlGroup in SplitToGroupsOfFive(urls))
    {
        var tasks = new List<Task>();
        foreach(var url in urlGroup)
        {
            var task = ProcessUrl(url);
            tasks.Add(task);
        }
        // This delay will sure that next 5 urls will be used only after 1 seconds
        tasks.Add(Task.Delay(1000));

        await Task.WhenAll(tasks.ToArray());
    }
}

private async Task ProcessUrl(string url)
{
    using (var client = new HttpClient())
    {
        var xml = await client.GetStringAsync(url);
        //do some processing on xml output
    }
}

private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
    var const GROUP_SIZE = 5;
    var string[] group = null;
    var int count = 0;

    foreach (var url in urls)
    {
        if (group == null)
            group = new string[GROUP_SIZE];

        group[count] = url;
        count++;

        if (count < GROUP_SIZE) 
            continue;

        yield return group;

        group = null;
        count = 0;
    }

    if (group != null && group.Length > 0)
    {
        yield return group.Take(group.Length);
    }
}

因为你提到响应的“处理”也是IO操作,所以async/await方法是最有效的,因为它只使用一个线程并在先前的任务等待来自web服务或来自文件的响应时处理其他任务编写IO操作。