我正在编写一个抓取网站并在每个页面上进行各种测试的应用程序。我想选择让爬虫一次运行多个网站并发抓取。我有一个半工作的解决方案。这将使用1个任务抓取网站OK,当我将其发送到最大并发任务2时,它按预期运行得更快。但是,当我将其设置为3或更高时,它似乎更慢。我来自PHP背景,所以我很确定我是以最糟糕的方式做这件事。
var DontLockGuiTask = Task.Run(() =>
{
while (true)
{
if (tokenSource2.IsCancellationRequested)
{
Logger.AddToActivityLog("Tasks stopped by user");
break;
}
if (URLsToCheck.Count == 0 && CurrentNumberOfScrapes == 0)
{
EndOfCheck = true;
break;
}
lock ("CurrentNumberOfScrapes")
{
CurrentNumberOfScrapes++;
}
var ScrapeTask = Task.Run(() =>
{
if (EndOfCheck)
{
CurrentNumberOfScrapes--;
return;
}
URLCheckResultObject CheckResultForURL;
Checker Checker = new Checker();
URLsToCheckObject URLToCheck = new URLsToCheckObject();
lock ("URLsToCheck")
{
if (URLsToCheck.Count == 0)
{
lock ("CurrentNumberOfScrapes")
{
CurrentNumberOfScrapes--;
return;
}
}
URLToCheck = URLsToCheck.First();
URLsToCheck.Remove(URLToCheck);
}
CheckResultForURL = Checker.PerformCheckOnURL(URLToCheck, this);
PagesCrawledCounter++;
ChecksPerformedCounter += CheckResultForURL.Checkcounter;
CheckResultForURL.URLID = PagesCrawledCounter;
Logger.AddToActivityLog("Checking " + URLToCheck.URLAddress + "....");
if(CheckResultForURL.NewListOFURLSToCheck != null)
{
foreach (LinkObject NewURLToAdd in CheckResultForURL.NewListOFURLSToCheck)
{
lock ("URLsToCheck")
{
string CleanURL = NewURLToAdd.destinationURL;
if (CleanURL.EndsWith("/"))
{
CleanURL = CleanURL.Substring(0, CleanURL.Length - 1);
}
if (URlsWeKnownAbout.Contains(CleanURL)) continue;
URlsWeKnownAbout.Add(CleanURL);
URLsToCheck.Add(new URLsToCheckObject { URLAddress = CleanURL, Host = host });
}
}
}
CheckResultForURL.NewListOFURLSToCheck = null;
if(CheckResultForURL.SocialCheckResult != null)
{
ProblemID++;
CheckResultForURL.SocialCheckResult.URLID = ProblemID;
InsertSQLProblemIntoDataGrid(CheckResultForURL.SocialCheckResult);
}
lock ("CurrentNumberOfScrapes")
{
CurrentNumberOfScrapes--;
}
});
while (CurrentNumberOfScrapes >= CurrentNumberScrapesMax)
{
if (tokenSource2.IsCancellationRequested == true)
{
Logger.AddToActivityLog("Tasks stopped by user");
break;
}
Thread.Sleep(100);
}
}
EnableUsedGUIForRun();
}, tokenSource2.Token);
正如你所看到的,我有几个while循环检查当前正在运行的任务有多少,如果不再需要则会休眠,当旧的任务完成或当前任务量低于{{ 1}}级别。
我该如何处理?我想管理所有访问相同变量的多个并发任务。
答案 0 :(得分:1)
如果您不向URLsToCheck
添加新任务,那么您的代码可能会简化为:
Parallel.ForEach(URLsToCheck,
new ParallelOptions { MaxDegreeOfParallelism = Emvironment.ProcessorCount }, url => CrawlAcross(url));
但是如果你确实需要更多的网址来抓取,那么你需要更复杂的逻辑。你可以在这里试试TPL Dataflow
,用这样的管道:
Buffer with urls --> Crawl the url processor --> Result saving
其中第二部分可以将其他网址发布回缓冲区。所以它可以是这样的:
var buffer = new BufferBlock<string>();
var processor = new TransformBlock<string, CrawlResult>(url =>
{
var result = CrawlAcross(url);
foreach (var additionalUrl in result.AdditionalUrlsToParse)
{
buffer.Post(additionalUrl);
}
return result;
});
var handler = new ActionBlock<CrawlResult>(r => HandleResult(r));
buffer.LinkTo(processor, new DataflowLinkOptions() { PropagateCompletion = true });
processor.LinkTo(handler, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (var url in URLsToCheck)
{
buffer.Post(url);
}
关于您的代码的附注:
while
循环应标记为LongRunning
lock
语句不应该对字符串常量进行,而应该在专用的static
对象上进行,以提高可读性和预测结果if (tokenSource2.IsCancellationRequested == true)
可简化为if (tokenSource2.IsCancellationRequested)
tokenSource2.IsCancellationRequested
标记,则应致电ThrowIfCancellationRequested