我正在尝试抓取网络,但无法通过.net HttpRequest
和HttpResponse
类访问登录。使用网络监视器,似乎一个关键的区别是来自浏览器的登录包含POST消息中的有效负载,而HttpRequest
在单独的消息中发送有效负载,获得301响应。有没有办法让它使用单个消息?或者还有其他我想念的东西?我已将此代码用于另一个有效的网站:
// Set GET to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(logonUrl);
SiteRequest.Method = "GET";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.Referer = logonUrl;
SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
mainStream = SiteResponse.GetResponseStream();
ReadAndIgnoreAllStreamBytes(mainStream);
mainStream.Close();
// Send POST to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(postUrl);
SiteRequest.Method = "POST";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.ContentType = "application/x-www-form-urlencoded";
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.CookieContainer.Add(SiteResponse.Cookies);
SiteRequest.Referer = postUrl;
SiteRequest.Timeout = TimeoutMsec;
buffer = Encoding.UTF8.GetBytes(logonPostData);
SiteRequest.ContentLength = buffer.Length;
postStream = SiteRequest.GetRequestStream();
postStream.Write(buffer, 0, buffer.Length);
postStream.Flush();
postStream.Close();
SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
在HtmlAgilityPack中使用HtmlWeb类有同样的问题。
感谢。
更新
原来我使用的是地址的“www.example.com”形式,而不是“example.com”,因此重定向。但是我找到了一个“404”页面未找到错误的正确地址。
以下是浏览器发送帖子的内容:
- Http: Request, POST /accounts/signin
Command: POST
+ URI: /accounts/signin
ProtocolVersion: HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: http://***.com/accounts/signin
Accept-Language: en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
UserAgent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; Touch)
+ ContentType: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
Host: example.com
ContentLength: 67
DNT: 1
Connection: Keep-Alive
Cache-Control: no-cache
- Cookie: PHPSESSID=169***efe; lang=en_US; cart=eyJ***wfQ%3D%3D; cartitems=W10%3D; __utma=***; __utmb=***; __utmc=**; __utmz=**
PHPSESSID: 169***efe
lang: en_US
cart: eyJ***wfQ%3D%3D
cartitems: W10%3D
__utma: ***
__utmb: ***
__utmc: ***
__utmz: ***
HeaderEnd: CRLF
- payload: HttpContentType = application/x-www-form-urlencoded
url:
email: ***
password: ***
这是我发送的内容:
(POST:)
- Http: Request, POST /accounts/signin
Command: POST
+ URI: /accounts/signin
ProtocolVersion: HTTP/1.1
+ ContentType: application/x-www-form-urlencoded
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
Accept-Encoding: gzip, deflate
DNT: 1
Cache-Control: no-cache
Referer: http://***.com/accounts/signin
Host: chinesepod.com
- Cookie: lang=en_US; cart=eyJ***jowfQ%3D%3D; cartitems=W10%3D; PHPSESSID=944***3e7
lang: en_US
cart: eyJ***wfQ%3D%3D
cartitems: W10%3D
PHPSESSID: 944***3e7
ContentLength: 61
HeaderEnd: CRLF
(单独的有效载荷:)
- Http: HTTP Payload, URL: /accounts/signin
- payload: HttpContentType = application/x-www-form-urlencoded
url:
email: ***
password: ***
浏览器版本有这些__utXX cookie,我假设浏览器添加了某种标记,对吧?否则,假设cookie排序无关紧要,关键区别在于有效载荷是单独发送的。看到别的什么事吗?
感谢。
-John