用C#线程任务和多网页下载性能问题

时间:2019-02-01 17:17:33

标签: c# multithreading

我正在运行代码,以从县级网站下载大量文档,通常是税务报表。开始时,我正在运行的代码似乎快速高效,并且在文件数达到200左右之前效果很好。这是性能开始下降的时候。如果我让它继续运行,它仍然可以工作,但是到了缓慢的地步。我通常必须停止它,找出尚未下载的文件,然后重新开始。

对于使此过程更快,更高效,更流畅(无论文件数如何)的任何帮助,将不胜感激。

我一直坚信性能问题与立即将结果写入html文件有关。我尝试将结果存储在StringBuilder中,直到下载完成,但是当然我的内存不足。

我还尝试过调整MaxDegreeOfParallelism,将其降低到5似乎影响不大,但是与文件数有关的性能问题仍然存在。

    private void Run_Mass_TaxBillDownload()
    {
        string strTag = null;
        string county = countyName.SelectedItem.ToString() + "-";

        //Converting urlList to uriList...
        List<Uri> uriList = new List<Uri>();
        foreach (string url in TextViewer.Lines)//"TextViewer is a textbox where urls to be downloaded are stored...
        {
            if (url.Length > 5){Uri myUri = new Uri(url.Trim(), UriKind.RelativeOrAbsolute);uriList.Add(myUri);}
        }

        Parallel.ForEach(uriList, new ParallelOptions { MaxDegreeOfParallelism = 5 }, str =>
        {
            using (WebClient client = new WebClient())
            {
                //Extracting taxbill numbers from the url to use as file names in the saved file...
                string FirstString = null;
                string LastString = null;
                if (str.ToString().ToLower().Contains("&tptick")) { FirstString = "&TPTICK="; LastString = "&TPSX="; }
                if (str.ToString().ToLower().Contains("&ticket=")) { FirstString = "&ticket="; LastString = "&ticketsuff="; }
                if (str.ToString().ToLower().Contains("demandbilling")) { FirstString = "&ticketNumber="; LastString = "&ticketSuffix="; }

                //Start downloading...
                client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
                client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(clientTaxBill_DownloadStringCompleted);
                client.DownloadStringAsync(str, county + (Between(str.ToString(), FirstString, LastString)));
            }
        });
    }
    private static void clientTaxBill_DownloadStringCompleted(Object sender, DownloadStringCompletedEventArgs e)
    {
        //Creating Output file....
        string deskTopPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
        string outputPath = deskTopPath + "\\Downloaded Tax Bills";
        string errOutputFile = outputPath + "\\errorReport.txt";
        string results = null;
        string taxBillNum = e.UserState as string;

        try
        {
            File.WriteAllText(outputPath + "\\" + taxBillNum + ".html", e.Result.ToString());
        }
        catch
        {
            results = Environment.NewLine + "<<{ERROR}>> NOTHING FOUND FOR" + taxBillNum;
            File.AppendAllText(errOutputFile, results);
        }
    }

1 个答案:

答案 0 :(得分:1)

如果DownloadStringAsync正在进行,那么它将一次运行5次以上的下载,DownloadStringCompleted将建立回叫,然后继续并再次循环。

因此,它不会等待每个完成。

ActionBlock是您的朋友,因为它与async代码一起使用效果更好,并且与httpClient(而不是WebClient)相结合

尝试这样的事情

public static async Task Downloader()
{
    var urls = new string[] { "https://www.google.co.uk/", "https://www.microsoft.com/" };

    var ab = new ActionBlock<string>(async (url)  => 
    {
        var httpClient = new HttpClient();
        var httpResponse = await httpClient.GetAsync(url);
        var text = await httpResponse.Content.ReadAsStringAsync();

        // just write it to a file
        Console.WriteLine(text);

    }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 5 });

    foreach(var url in urls)
    {
        await ab.SendAsync(url);
    }

    ab.Complete(); 
    await ab.Completion;
    Console.WriteLine("Done");
    Console.ReadKey();
}

MaxDegreeOfParallelism = 5表示5个线程,wait ab.SendAsync(url);很重要,就像您想用BoundedCapacity = n限制缓冲区大小一样,这将等待直到有空间,而ab.Post()方法不会,如果没有空间,它将仅返回false