我正在使用htmlagilitypack来抓取网页的某些部分。我得到的是实际输出但并非总是如此。
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4";
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var resultPriceTable = doc.DocumentNode.SelectNodes("//div[@class='resultsset']//table");
resultPriceTable在某些情况下变为空(接近50%)。从调试中我发现
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
导致问题。它有时不加载网址。如何解决这个问题?
提前致谢。
答案 0 :(得分:0)
尝试通过WebClient或HttpWebRequest / HttpWebResponse加载您的页面,然后将结果发送到HtmlAgilityPack
此代码示例如果您获得空字符串或获取WebException
不会简单地跳过异常,您需要小心处理(或者至少记录它)
的样品:强> 的
string html = string.Empty;
int tries = 5;
while (tries > 0)
{
using (var client = new WebClient())
{
string url = "http://google.com/";
client.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4");
try
{
html = client.DownloadString(url);
tries--;
if (!string.IsNullOrEmpty(html))
{
break;
}
}
catch (WebException)
{
tries--;
}
}
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);