我有一项任务,我想从某个特定的网站下载数据,我检查了httpwebrequest和httpwebresponse的主题,但我仍然无法理解我们如何填写该网站上的文本框并按下登录按钮进入网站内部,登录网站是第一步。
如果有人可以帮助我们,我想用VB.net来完成这项任务。
提前致谢。
答案 0 :(得分:0)
我使用ASP.NET MVC撰写了一篇关于此内容的文章,它可以帮助您:Introduction to Web Scraping with HttpWebRequest using ASP.NET MVC 3
Web Scrapping 意味着解析这些页面以结构化方式提取信息。它还指创建一个程序化界面,一个API,通过一个适合人类的HTML界面与一个站点进行交互。
请不要认为您的程序会输入用户名和密码到相应的文本框中,然后程序会按下登录按钮进入网站。
Web Scraping不会以这种方式发生。
您需要研究网站如何发布登录用户名和密码,然后使用您需要的代码来获取cookie。在每一步,而不是在每个页面上,您需要使用firebug或chrome dev工具查看网站的工作方式,然后相应地发送POST或GET数据以获得您想要的内容。
以下是我从WordPress网站上搜索数据时编写的内容,我在适用的地方添加了注释,以便使代码更易于阅读。
以下是您需要定义的常量,'UserName'和'Pwd'是我的WordPress帐户的登录详细信息,'Url'代表登录页面网址,'ProfileUrl'是页面的地址显示个人资料详情。
const string Url = "http://yassershaikh.com/wp-login.php";
const string UserName = "guest";
const string Pwd = ".netrocks!!"; // n this not my real pwd :P
const string ProfileUrl = "http://yassershaikh.com/wp-admin/profile.php";
public ActionResult Index()
{
string postData = Crawler.PreparePostData(UserName, Pwd, Url);
byte[] data = Crawler.GetEncodedData(postData);
string cookieValue = Crawler.GetCookie(Url, data);
var model = Crawler.GetUserProfile(ProfileUrl, cookieValue);
return View(model);
}
I had created a static class called “Crawler”, here’s the code for it.
// preparing post data
public static string PreparePostData(string userName, string pwd, string url)
{
var postData = new StringBuilder();
postData.Append("log=" + userName);
postData.Append("&");
postData.Append("pwd=" + pwd);
postData.Append("&");
postData.Append("wp-submit=Log+In");
postData.Append("&");
postData.Append("redirect_to=" + url);
postData.Append("&");
postData.Append("testcookie=1");
return postData.ToString();
}
public static byte[] GetEncodedData(string postData)
{
var encoding = new ASCIIEncoding();
byte[] data = encoding.GetBytes(postData);
return data;
}
public static string GetCookie(string url, byte[] data)
{
var webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2";
webRequest.AllowAutoRedirect = false;
Stream requestStream = webRequest.GetRequestStream();
requestStream.Write(data, 0, data.Length);
requestStream.Close();
var webResponse = (HttpWebResponse)webRequest.GetResponse();
string cookievalue = string.Empty;
if (webResponse.Headers != null && webResponse.Headers["Set-Cookie"] != null)
{
cookievalue = webResponse.Headers["Set-Cookie"];
// Modify CookieValue
cookievalue = GenerateActualCookieValue(cookievalue);
}
return cookievalue;
}
public static string GenerateActualCookieValue(string cookievalue)
{
var seperators = new char[] { ';', ',' };
var oldCookieValues = cookievalue.Split(seperators);
string newCookie = oldCookieValues[2] + ";" + oldCookieValues[0] + ";" + oldCookieValues[8] + ";" + "wp-settings-time-2=1345705901";
return newCookie;
}
public static List<string> GetUserProfile(string profileUrl, string cookieValue)
{
var webRequest = (HttpWebRequest)WebRequest.Create(profileUrl);
webRequest.Method = "GET";
webRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2";
webRequest.AllowAutoRedirect = false;
webRequest.Headers.Add("Cookie", cookieValue);
var responseCsv = (HttpWebResponse)webRequest.GetResponse();
Stream response = responseCsv.GetResponseStream();
var htmlDocument = new HtmlDocument();
htmlDocument.Load(response);
var responseList = new List<string>();
// reading all input tags in the page
var inputs = htmlDocument.DocumentNode.Descendants("input");
foreach (var input in inputs)
{
if (input.Attributes != null)
{
if (input.Attributes["id"] != null && input.Attributes["value"] != null)
{
responseList.Add(input.Attributes["id"].Value + " = " + input.Attributes["value"].Value);
}
}
}
return responseList;
}
希望这有帮助。