我做这个c#并发任务管理的方式有多少?

时间:2017-04-11 11:51:19

标签: c# .net multithreading concurrency task-parallel-library

我正在编写一个抓取网站并在每个页面上进行各种测试的应用程序。我想选择让爬虫一次运行多个网站并发抓取。我有一个半工作的解决方案。这将使用1个任务抓取网站OK,当我将其发送到最大并发任务2时,它按预期运行得更快。但是,当我将其设置为3或更高时,它似乎更慢。我来自PHP背景,所以我很确定我是以最糟糕的方式做这件事。

var DontLockGuiTask = Task.Run(() =>
{
    while (true)
    {
        if (tokenSource2.IsCancellationRequested)
        {
            Logger.AddToActivityLog("Tasks stopped by user");
            break;
        }

        if (URLsToCheck.Count == 0 && CurrentNumberOfScrapes == 0)
        {
            EndOfCheck = true;
            break;
        }

        lock ("CurrentNumberOfScrapes")
        {
            CurrentNumberOfScrapes++;
        }

        var ScrapeTask = Task.Run(() =>
        {
            if (EndOfCheck)
            {
                CurrentNumberOfScrapes--;
                return;
            }

            URLCheckResultObject CheckResultForURL;

            Checker Checker                 = new Checker();
            URLsToCheckObject URLToCheck    = new URLsToCheckObject();

            lock ("URLsToCheck")
            {
                if (URLsToCheck.Count == 0)
                {
                    lock ("CurrentNumberOfScrapes")
                    {
                        CurrentNumberOfScrapes--;
                        return;
                    }
                }

                URLToCheck = URLsToCheck.First();
                URLsToCheck.Remove(URLToCheck);
            }

            CheckResultForURL = Checker.PerformCheckOnURL(URLToCheck, this);

            PagesCrawledCounter++;
            ChecksPerformedCounter += CheckResultForURL.Checkcounter;

            CheckResultForURL.URLID = PagesCrawledCounter;

            Logger.AddToActivityLog("Checking " + URLToCheck.URLAddress + "....");

            if(CheckResultForURL.NewListOFURLSToCheck != null)
            {
                foreach (LinkObject NewURLToAdd in CheckResultForURL.NewListOFURLSToCheck)
                {
                    lock ("URLsToCheck")
                    {
                        string CleanURL = NewURLToAdd.destinationURL;

                        if (CleanURL.EndsWith("/"))
                        {
                            CleanURL = CleanURL.Substring(0, CleanURL.Length - 1);
                        }

                        if (URlsWeKnownAbout.Contains(CleanURL)) continue;

                        URlsWeKnownAbout.Add(CleanURL);
                        URLsToCheck.Add(new URLsToCheckObject { URLAddress = CleanURL, Host = host });
                    }
                }
            }

            CheckResultForURL.NewListOFURLSToCheck = null;

            if(CheckResultForURL.SocialCheckResult != null)
            {
                ProblemID++;
                CheckResultForURL.SocialCheckResult.URLID = ProblemID;
                InsertSQLProblemIntoDataGrid(CheckResultForURL.SocialCheckResult);
            }

            lock ("CurrentNumberOfScrapes")
            {
                CurrentNumberOfScrapes--;
            }
        });

        while (CurrentNumberOfScrapes >= CurrentNumberScrapesMax)
        {
            if (tokenSource2.IsCancellationRequested == true)
            {
                Logger.AddToActivityLog("Tasks stopped by user");
                break;
            }

            Thread.Sleep(100);
        }
    }

    EnableUsedGUIForRun();

}, tokenSource2.Token); 

正如你所看到的,我有几个while循环检查当前正在运行的任务有多少,如果不再需要则会休眠,当旧的任务完成或当前任务量低于{{ 1}}级别。

我该如何处理?我想管理所有访问相同变量的多个并发任务。

1 个答案:

答案 0 :(得分:1)

如果您不向URLsToCheck添加新任务,那么您的代码可能会简化为:

Parallel.ForEach(URLsToCheck,
    new ParallelOptions { MaxDegreeOfParallelism = Emvironment.ProcessorCount }, url => CrawlAcross(url));

但是如果你确实需要更多的网址来抓取,那么你需要更复杂的逻辑。你可以在这里试试TPL Dataflow,用这样的管道:

Buffer with urls --> Crawl the url processor --> Result saving

其中第二部分可以将其他网址发布回缓冲区。所以它可以是这样的:

var buffer = new BufferBlock<string>();
var processor = new TransformBlock<string, CrawlResult>(url => 
{
    var result = CrawlAcross(url);
    foreach (var additionalUrl in result.AdditionalUrlsToParse)
    {
        buffer.Post(additionalUrl);
    }
    return result;
});
var handler = new ActionBlock<CrawlResult>(r => HandleResult(r));

buffer.LinkTo(processor, new DataflowLinkOptions() { PropagateCompletion = true });
processor.LinkTo(handler, new DataflowLinkOptions() { PropagateCompletion = true });

foreach (var url in URLsToCheck)
{
    buffer.Post(url);
}

关于您的代码的附注:

    任务中的
  • while循环应标记为LongRunning
  • 子任务应该移出
  • lock语句不应该对字符串常量进行,而应该在专用的static对象上进行,以提高可读性和预测结果
  • if (tokenSource2.IsCancellationRequested == true)可简化为if (tokenSource2.IsCancellationRequested)
  • 如果您有tokenSource2.IsCancellationRequested标记,则应致电ThrowIfCancellationRequested
  • 你应该处理令牌而不是tokenSource
  • 也许别的什么,很难说