我正在尝试同时抓取多个网址。每个请求都可以向ConcurrentBag
添加更多网址以进行抓取。目前,我有一个令人讨厌的时候(真实)开始一个新的Parallel.ForEach
来处理任何新的URL。
我有什么方法可以添加ConcurrentBag
的内容,以便Parallel.ForEach
会看到其中有新项目并继续迭代这些新项目吗?
ConcurrentBag<LinkObject> URLSToCheck = new ConcurrentBag<LinkObject>();
while (true)
{
Parallel.ForEach(URLSToCheck, new ParallelOptions { MaxDegreeOfParallelism = 5 }, URL =>
{
Checker Checker = new Checker();
URLDownloadResult result = Checker.downloadFullURL(URL.destinationURL);
List<LinkObject> URLsToAdd = Checker.findInternalUrls(URL.sourceURL, result.html);
foreach (var URLToAdd in URLsToAdd)
{
URLSToCheck.Add(new LinkObject { sourceURL = URLToAdd.sourceURL, destinationURL = URLToAdd.destinationURL });
}
});
if(URLSToCheck.Count == 0)break;
}
答案 0 :(得分:4)
您可以查看BlockingCollection。
BlockingCollection提供了生产者/消费者模式的实现:你的生产者将添加到阻塞集合中,你的Parallel.ForEach将从集合中消费。
为此,您必须为BlockingCollection实现自定义分区程序(原因在此解释:https://blogs.msdn.microsoft.com/pfxteam/2010/04/06/parallelextensionsextras-tour-4-blockingcollectionextensions/)
分区:
class BlockingCollectionPartitioner<T> : Partitioner<T>
{
private BlockingCollection<T> _collection;
internal BlockingCollectionPartitioner(BlockingCollection<T> collection)
{
if (collection == null)
throw new ArgumentNullException("collection");
_collection = collection;
}
public override bool SupportsDynamicPartitions
{
get { return true; }
}
public override IList<IEnumerator<T>> GetPartitions(int partitionCount)
{
if (partitionCount < 1)
throw new ArgumentOutOfRangeException("partitionCount");
var dynamicPartitioner = GetDynamicPartitions();
return Enumerable.Range(0, partitionCount).Select(_ => dynamicPartitioner.GetEnumerator()).ToArray();
}
public override IEnumerable<T> GetDynamicPartitions()
{
return _collection.GetConsumingEnumerable();
}
}
然后你将使用它:
BlockingCollection<LinkObject> URLSToCheck = new BlockingCollection<LinkObject>();
Parallel.ForEach(
new BlockingCollectionPartitioner<LinkObject>(URLSToCheck),
new ParallelOptions { MaxDegreeOfParallelism = 5 }, URL =>
{
//....
});
在另一个线程中,您将添加到URLSToCheck集合中:
URLSToCheck.Add(...)
当您完成要处理的网址时,请致电URLSToCheck.CompleteAdding()
并且Parallel.ForEach
应该自动停止。
答案 1 :(得分:2)
DataFlow在这里很方便。使用ActionBlock
可以很好地完成:
// Capture the variable, so it can be used in the next block
ActionBlock<LinkObject> = actionBlock = null;
actionBlock = new ActionBlock<LinkObject>(URL =>
{
Checker Checker = new Checker();
URLDownloadResult result = Checker.downloadFullURL(URL.destinationURL);
List<LinkObject> URLsToAdd = Checker.findInternalUrls(URL.sourceURL, result.html);
URLsToAdd.ForEach(actionBlock.Post)
},new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
然后添加到actionBlock
您的初始网址:
actionBlock.Post(url1);
actionBlock.Post(url2);
...