Web刮密码保护的网站,但有错误

时间:2016-08-09 06:54:50

标签: r web-scraping password-protection

我正在尝试从网站的成员目录中抓取数据(" members.dublinchamber.ie")。我曾经尝试过使用' rvest'但是,即使在输入登录详细信息后,我也从登录页面获得了数据。代码如下:

library(rvest)
url <- "members.dublinchamber.ie/login.aspx"
pgsession <- html_session(url)  
pgform <- html_form(pgsession)[[2]]
filled_form <- set_values(pgform,
                      "Username" = "username",
                      "Password" = "password")
submit_form(pgsession, filled_form)
memberlist <-      jump_to(pgsession,'members.dublinchamber.ie/directory/profile.aspx?compid=50333')
page <- read_html(memberlist)
usernames <- html_nodes(x = page, css = 'css of required data')
data_usernames <- data.frame(html_text(usernames, trim =   TRUE),stringsAsFactors = FALSE)

我还使用了RCurl,我再次从登录页面获取数据。 RCurl代码如下:

library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('http://members.dublinchamber.ie/login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value=['142555296'].*', '\\1', html))
params <- list(
  'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$username'= 'username',
  'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$password'= 'pass',
  'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$btnSubmit'= 'login',
  '__VIEWSTATE'                         = viewstate
)
html = postForm('http://members.dublinchamber.ie/login.aspx', .params = params, curl = curl)
 grep('Logout', html)

实际上有3个网址: 1)members.dublinchamber.ie/directory/default.aspx(具有所有行业的名称,需要点击任何行业) 2)members.dublinchamber.ie/directory/default.aspx?industryVal=AdvMarPubrel(advmarpubrel只是一个小字符串,当我点击该行业时生成) 3)members.dublinchamber.ie/directory/profile.aspx?compid=19399(这里有我在上一页点击的特定公司的个人资料信息)

我想抓取数据,该数据应该给我行业名称,每个行业中的公司列表以及它们在上面第3个URL中作为表格显示的详细信息。 我是新来的,也是R,webscrape。如果问题很长或不清楚,请不要介意。

0 个答案:

没有答案