我正在写一个抓取特定URL并将其添加到列表的网络爬虫。
using HtmlAgilityPack;
List<string> mylist = new List<string>();
var firstUrl = "http://example.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(firstUrl);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Add(htmlNode.InnerText);
}
}
这时我想做的是遍历“ mylist”并做完全相同的事情,并且基本上永远继续下去。该代码应采用新解析的URL,并将其添加到列表中。最简单的方法是什么?
我尝试在上述代码之后创建一个for循环。但是它似乎并没有更新列表。它只会永远继续循环遍历列表中已存在的相同项目(因为我将始终小于mylist.Count)
for (int i = 0; i < mylist.Count; i++)
{
//the items in mylist are added to the url
var urls = "http://example.com" + mylist[i];
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(urls);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Add(htmlNode.InnerText);
}
}
}
谢谢!
答案 0 :(得分:1)
Queue
符合您的要求。
Queue<string> mylist = new Queue<string>();
第一遍:
using HtmlAgilityPack;
Queue<string> mylist = new Queue<string>();
var firstUrl = "http://example.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(firstUrl);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Enqueue(htmlNode.InnerText);
}
}
现在第二遍了
while (mylist.Count > 0)
{
var url = mylist..Dequeue();
//the items in mylist are added to the url
var urls = "http://example.com" + url;
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(urls);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Enqueue(htmlNode.InnerText);
}
}
}
答案 1 :(得分:0)
使用递归的一种可能的(危险的?)实现,将在使用它们时生成网址:
$ cat tmp/work/i586-poky-linux/initramfs-live-install-efi/1.0-r1/init-install-efi.sh
#!/bin/bash
echo "hello"
用法:
public IEnumerable<string> Scrap(string url)
{
var web = new HtmlWeb();
var seenUrls = new HashSet<string>();
return ScrapImpl(web, seenUrls, url);
}
private IEnumerable<string> ScrapImpl(HtmlWeb web, HashSet<string> seenUrls, string baseUrl)
{
var document = web.Load(baseUrl);
foreach (var node in document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a"))
{
yield return node.InnerText;
if (seenUrls.Add(node.InnerText))
{
foreach (var childUrl in ScrapImpl(web, seenUrls, baseUrl + node.InnerText))
{
yield return childUrl;
}
}
}
}
答案 2 :(得分:0)
转到NuGet“ System.Interactive”,然后执行以下操作:
var found = new HashSet<string>();
var urls =
EnumerableEx
.Expand(
new[] { "http://example.com" },
url =>
{
var web = new HtmlWeb();
var document = web.Load(url);
var nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
return
nodes
.Cast<HtmlNode>()
.Select(x => x.InnerText)
.Where(x => !found.Contains(x))
.Do(x => found.Add(x))
.Select(x => "http://example.com" + x);
});