无法获取页面的HTML

时间:2012-04-15 17:15:46

标签: c# .net exception-handling httpwebrequest web-scraping

我想使用HTTPWEBREQUEST为以下页面获取HTML:

http://inkdispatch.com/brother

目前我正在使用:

 public static string getHTML(string url)
    {
        string responseData = "";
        try
        {
            //    System.Threading.Thread.Sleep(1000 * 1);
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
            request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)";
            request.Timeout = 60000;
            request.AllowAutoRedirect = false;
            request.Method = "GET";
            request.Referer = "inkdispatch.com";
            request.CookieContainer = yummycookies;
            request.KeepAlive = true;

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            if (response.StatusCode == HttpStatusCode.OK)
            {
                Stream responseStream = response.GetResponseStream();
                StreamReader myStreamReader = new StreamReader(responseStream);
                responseData = myStreamReader.ReadToEnd();
            }
            foreach (Cookie cook in response.Cookies)
            {
                yummycookies.Add(cook);
            }
            response.Close();
        }
        catch (Exception e)
        {
            responseData = "An error occurred: " + e.Message;
        }

        return responseData;

    }

但是我没有看到任何我得到回应而没有错误只是说,移动渗透,当我在浏览器中放置相同的链接时它工作。该链接附有一个令牌,但我确实从主页面获得了这一点,但仍然可以提供任何帮助。

  

更新

我刚刚设置:

 request.AllowAutoRedirect = true;

但我收到错误:

    Too many automatic redirections were attempted.
   at System.Net.HttpWebRequest.GetResponse()
   at inkdispatchcomScraper.Program.getHTML(String url) 

我有小提琴打开,显示它一次又一次地击中链接:

    #   Result  Protocol    Host    URL Body    Caching Content-Type    Process Comments    Custom  
72  301 HTTP    inkdispatch.com /brother?zenid=00810c6a184e63149cdca848c7f02871 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
73  301 HTTP    inkdispatch.com /brother?zenid=32cf6d38541a90658d39785b6cd64fbc 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
74  301 HTTP    inkdispatch.com /brother?zenid=70d0d5eaa10175d74933ba00d47876f8 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
75  301 HTTP    inkdispatch.com /brother?zenid=fa45c256a07a9450274269cfa4a4e64a 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
76  301 HTTP    inkdispatch.com /brother?zenid=1fb7677a7e6ae0ca32a154ebcc42e043 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
77  301 HTTP    inkdispatch.com /brother?zenid=39923f8100276b1c0fa5ccfb1f8d222c 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
78  301 HTTP    inkdispatch.com /brother?zenid=fef228719b375ac012c4755793a0027a 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
79  301 HTTP    inkdispatch.com /brother?zenid=5c2babf5e6b9b0834f605734441ba208 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
80  301 HTTP    inkdispatch.com /brother?zenid=711bdefa3ca7cccebf63b9b8a3734be1 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
81  301 HTTP    inkdispatch.com /brother?zenid=c55d1b6166994be1436c9473a1519abe 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
83  301 HTTP    inkdispatch.com /brother?zenid=cc66424548f23c3c64b2e0054289283f 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
84  301 HTTP    inkdispatch.com /brother?zenid=6f05f06093cd345d10ca729117994ac0 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
85  301 HTTP    inkdispatch.com /brother?zenid=4a2ab4d3824c4850f544f28cd71bc1bb 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
86  301 HTTP    inkdispatch.com /brother?zenid=6c9d0acd69fc22821014c7e3263da7b6 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
87  301 HTTP    inkdispatch.com /brother?zenid=fff05b8df3a1488add36591a2687a830 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
88  301 HTTP    inkdispatch.com /brother?zenid=b10facbe8bc9b9a355fe648649067f98 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
89  301 HTTP    inkdispatch.com /brother?zenid=8b767c98491178e54d12b4e85ff02b2e 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
90  301 HTTP    inkdispatch.com /brother?zenid=9f0b8cb119fee9a4e276bcae5f13772d 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
91  301 HTTP    inkdispatch.com /brother?zenid=943076fabf058eb1316cfa86aadb1dec 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
92  301 HTTP    inkdispatch.com /brother?zenid=8bd0335032a58b9c399706cd9c695901 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
93  301 HTTP    inkdispatch.com /brother?zenid=a1ba5e21f0af2750d398484e063e8303 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
94  301 HTTP    inkdispatch.com /brother?zenid=e704b2951b1d136c195fd02ad4abec93 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
95  301 HTTP    inkdispatch.com /brother?zenid=6d606d0785f19c17ccb1868577a9d546 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612   
  

另一个更新

我已经看到当我在IE中打开它时,它使用重定向到/兄弟但是在代码的情况下它会获得另一个ZENID蚂蚁前进到那个并且这种情况一直在发生。

2 个答案:

答案 0 :(得分:4)

设置request.AllowAutoRedirect = true;

修改

对于第二个问题,请声明yummycookies,如下所示。

public static string getHTML(string url)
{
   CookieContainer yummycookies = new CookieContainer();
   ...
}

答案 1 :(得分:0)

当我尝试测试你的代码时,它失败了,但是另一个测试我发现了以下错误“尝试了太多的自动重定向。”

在更新代码并再次测试时,它在您提供的网址上运行良好,html正确获取。代码在这里。

public static string GetHtml2(string urlAddr)
{
    if (urlAddr == null || string.IsNullOrEmpty(urlAddr))
    {
        throw new ArgumentNullException("urlAddr");
    }
    else
    {
        string result;

        //1.Create the request object
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddr);
        //request.AllowAutoRedirect = true;
        //request.MaximumAutomaticRedirections = 200;
        request.Proxy = null;
        request.UseDefaultCredentials = true;

        //2.Add the container with the active
        CookieContainer cc = new CookieContainer();


        //3.Must assing a cookie container for the request to pull the cookies
        request.CookieContainer = cc;

        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        using (StreamReader sr = new StreamReader(response.GetResponseStream()))
        {
            result = sr.ReadToEnd();
            //Close and clean up the StreamReader
            sr.Close();
        }
        return result;
    }
}

希望这没关系。