使用WebRequest尝试了太多重定向

时间:2015-12-22 16:38:17

标签: c# web-scraping webrequest

当试图抓取网页的html时,偶尔会出现“尝试重定向太多”的异常。

此类网站的一个示例是http://www.magicshineuk.co.uk/

通常我会将超时设置为6秒......但即使有30秒,而Max Redirections允许像200这样的疯狂,它仍会抛出“太多重定向”异常,或者,超时将会发生。

如何解决这个问题?

我的代码如下......

    try
{

   System.Net.WebRequest request = System.Net.WebRequest.Create("http://www.magicshineuk.co.uk/");

   var hwr = ((HttpWebRequest)request);

   hwr.UserAgent ="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0";
   hwr.Headers.Add("Accept-Language", "en-US,en;q=0.5");
   hwr.Headers.Add("Accept-Encoding", "gzip, deflate");

   hwr.ContentType = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"; ;
   hwr.KeepAlive = true;
   hwr.Timeout = 30000;   // 30 seconds...  normally set to 6000
   hwr.Method = "GET";
   hwr.AllowAutoRedirect = true;
   hwr.CookieContainer = new System.Net.CookieContainer();

   // Setting this Makes no difference... normally I would like to keep to a sensible maximum but I will leave as the default of 50 if needs be... 
   // Either way, the Too Many Redirections exception occurs
   hwr.MaximumAutomaticRedirections = 200;   

   using (var response = (HttpWebResponse)hwr.GetResponse())
   {

       Console.WriteLine(String.Format("{0} {1}", (int)response.StatusCode, response.StatusCode));
       Console.WriteLine(response.ResponseUri);
       Console.WriteLine("Last modified: {0}", response.LastModified);
       Console.WriteLine("Server: {0}", response.Server);
       Console.WriteLine("Supports Headers: {0}", response.SupportsHeaders);
       Console.WriteLine("Headers: ");

       // do something... e.g:
       int keyCount = response.Headers.Keys.Count;
       int i = 0;
       Dictionary<string, string> hc = new Dictionary<string, string>();
       foreach (var hname in response.Headers)
       {
          var hv = response.Headers[i].ToString();
          hc.Add(hname.ToString(), hv);
          i++;
       }
       foreach (var di in hc)
       {
          Console.WriteLine("  {0} = {1}", di.Key, di.Value);
       }

   }


}
catch (Exception ex)
{
    Console.WriteLine("Exception: ");
    Console.WriteLine(ex.Message);
}   

1 个答案:

答案 0 :(得分:2)

我尝试了你的代码,我需要注释掉// hwr.Host = Utils.GetSimpleUrl(url);并且它运行正常。如果您经常轮询,那么目标站点或其间的某些内容(代理,防火墙等)可能会将您的轮询识别为拒绝服务并在一段时间内为您计时。或者,如果您位于公司防火墙后面,则可能会从内部网络设备收到类似信息。

你经常使用这个刮刀吗?

编辑添加:

  • 我尝试使用.net 4.52,Windows 7 x64,Visual Studio 2015

  • 目标网站也可能不可靠(上下)

  • 您与目标网站之间可能存在间歇性网络问题
  • 他们可能会公开一个API,这将是一个更可靠的集成