所以我遇到了一种情况,我使用HtmlAgilityPack
来加载网页以便抓取文档内容。我有一些我需要加载的URL,其中一些需要gzip编码,所以我抓住HtmlWeb.load()
抛出的异常,检查它是否是gzip编码问题,然后处理页面加载HttpWebRequest
。但是,这允许第一次使用HttpWebRequest
成功,但HttpWebRequest
的任何其他尝试都将超时。
这是代码的清理版本:
HtmlDocument doc = new HtmlDocument();
HtmlWeb web = new HtmlWeb();
try
{
doc = web.Load(uri);
Console.WriteLine("htmlweb and htmldocument success");
}
catch (ArgumentException ae)
{
Console.WriteLine("htmlweb and htmldocument not successful");
if (ae.Message.Contains("\'gzip\'"))
{
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(uri);
try
{
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
req.Method = "GET";
//req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
req.KeepAlive = false;
//req.Timeout = 100000;
// On the second iteration we never get beyond this line
using (WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
using (StreamReader reader = new StreamReader(httpWebResponse.GetResponseStream()))
{
source = reader.ReadToEnd();
}
}
}
req.Abort();
Console.WriteLine("httpwebresponse successfull");
}
catch (WebException we)
{
Console.WriteLine("httpwebresponse not successful");
}
}
}
我需要做一些清理工作吗?还是有什么我忘记的?
非常感谢任何帮助。
答案 0 :(得分:0)
我认为我必须首先通过WebRequest加载,而不是HtmlWeb。然后检查gzip的响应头,并根据需要每次解压缩。
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(uri);
//req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
//req.AutomaticDecompression = System.Net.DecompressionMethods.Deflate | System.Net.DecompressionMethods.GZip;
//req.Method = "GET";
string source = String.Empty;
try
{
using (System.Net.WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
StreamReader reader;
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
{
reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
{
reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else
{
reader = new StreamReader(httpWebResponse.GetResponseStream());
}
source = reader.ReadToEnd();
}
}
req.Abort();
}
catch(Exception ex){
//received a 404 Error - apparently one of my links is now dead...
}