如何从RData中的URL到RSelenium中的抓取? (受密码保护的网站)

时间:2019-05-06 11:33:33

标签: r web-scraping rselenium

我有一些刮刮报纸的网址。该URL为RData格式。 我正在尝试从http://politiken.dk/arkiv/抓取新闻 这是一个需要密码和登录名的网站。我有。

我编写了代码,以大致访问该网站,并且该网站正常工作。

现在,我需要将每条新闻的文本分成几页。 URL和正常代码(如果不需要密码)就可以了。但这是行不通的,所以我想我必须使用RSelenium来获取URL内的所有文本。

这将是不使用RSelenium的代码

headlines <- rep("",nrow(politiken.unique))
for(i in 1:nrow(politiken.unique)){
  try({
    text <- read_html(as.character(politiken.unique$urls[i])) %>%
      html_nodes(".summary__p") %>% 
      html_text(trim = T) 
    headlines[i] = paste(text, collapse = " ")
  })
}

但是很明显,这不适用于RSelenium。

到目前为止,我有这个功能(网站上的登录名):

# Login in the website
url <- "https://medielogin.dk/politiken/login?redirect=%2Fopenid%2Fendpoint%3Fopenid.ns%3Dhttp%3A%252F%252Fspecs.openid.net%252Fauth%252F2.0%26openid.claimed_id%3Dhttp%3A%252F%252Fspecs.openid.net%252Fauth%252F2.0%252Fidentifier_select%26openid.identity%3Dhttp%3A%252F%252Fspecs.openid.net%252Fauth%252F2.0%252Fidentifier_select%26openid.return_to%3Dhttps%3A%252F%252Fpolitiken.dk%252F%253Fpolid_return%253D1556061648%26openid.realm%3Dhttps%3A%252F%252Fpolitiken.dk%26openid.assoc_handle%3D7FNp!IAAAAJOSsCUfDPIhEzFBywNx1aXHKOZanVsMLPzmtapZJI3tQQAAAAEvGB5AgUqaWQPLeSFCYZf9FrsoqDOLz1jwhFWSebEvBo2JaUdfcjULF5tkWHI4GDSYH04oXa8S0roaQVQuJMwA%26openid.mode%3Dcheckid_setup%26openid.ns.ext1%3Dhttp%3A%252F%252Fopenid.net%252Fsrv%252Fax%252F1.0%26openid.ext1.brand%3Dpolitiken"

rd <- rsDriver(browser=c("chrome"), chromever="74.0.3729.6")
driver = rd[['client']]
driver$navigate("https://medielogin.dk/politiken/login?redirect=%2Fopenid%2Fendpoint%3Fopenid.ns%3Dhttp%3A%252F%252Fspecs.openid.net%252Fauth%252F2.0%26openid.claimed_id%3Dhttp%3A%252F%252Fspecs.openid.net%252Fauth%252F2.0%252Fidentifier_select%26openid.identity%3Dhttp%3A%252F%252Fspecs.openid.net%252Fauth%252F2.0%252Fidentifier_select%26openid.return_to%3Dhttps%3A%252F%252Fpolitiken.dk%252F%253Fpolid_return%253D1556061648%26openid.realm%3Dhttps%3A%252F%252Fpolitiken.dk%26openid.assoc_handle%3D7FNp!IAAAAJOSsCUfDPIhEzFBywNx1aXHKOZanVsMLPzmtapZJI3tQQAAAAEvGB5AgUqaWQPLeSFCYZf9FrsoqDOLz1jwhFWSebEvBo2JaUdfcjULF5tkWHI4GDSYH04oXa8S0roaQVQuJMwA%26openid.mode%3Dcheckid_setup%26openid.ns.ext1%3Dhttp%3A%252F%252Fopenid.net%252Fsrv%252Fax%252F1.0%26openid.ext1.brand%3Dpolitiken")

user = driver$findElement(using='css selector','input#Username')
driver$mouseMoveToLocation(webElement=user)
driver$click()
driver$sendKeysToActiveElement(list('email'))

pass = driver$findElement(using='css selector', 'input#Password')
driver$mouseMoveToLocation(webElement=pass)
driver$click()
driver$sendKeysToActiveElement(list('password'))

login = driver$findElement(using = 'css selector', 'button.ml-submit')

driver$mouseMoveToLocation(webElement=login)
driver$click()

如何使用RSelenium在网站的URL中获取文本????

0 个答案:

没有答案