rvest vs RSelenium结果用于文本提取

时间:2019-07-02 17:35:30

标签: r rvest rselenium

到目前为止,我正在使用RSelenium来提取主页的文本,但是我想切换到rvest之类的快速解决方案。

library(rvest)
url = 'https://www.r-bloggers.com'
rvestResults <- read_html(url) %>%
  html_node('body') %>%
  html_text()

library(RSelenium)
remDr$navigate(url)
rSelResults <- remDr$findElement(
  using = "xpath",
  value = "//body"
)$getElementText()

比较以下结果表明rvest包含一些JavaScript代码,而 RSelenium更“干净”。

我知道rvest和rselenium之间的区别,rselenium使用无头浏览器,而rvest只是读取“普通主页”。

我的问题是:有没有办法让rvest的Rselenium Output低于或以第三种方式比rvest更快(或更快)?

投资结果:

> substring(rvestResults, 1, 500)
[1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!
     \t\t\t\r\nfunction init() {\r\nvar vidDefer = document.getElementsByTagName('iframe');\r\nfor (var i=0; i<vidDefer.length; i++) {\r\nif(vidDefer[i].getAttribute('data-src')) 
     {\r\nvidDefer[i].setAttribute('src',vidDefer[i].getAttribute('data-src'));\r\n} } }\r\nwindow.onload = i"

硒结果:

> substring(rSelResults, 1, 500)
[1] "R news and tutorials contributed by (750) R bloggers\nHome\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\n�\n�\n�\nContact us\nWELCOME!\nHere you will find daily news and tutorials about R, 
     contributed by over 750 bloggers.\nThere are many ways to follow us -\nBy e-mail:\nOn Facebook:\nIf you are an R blogger yourself you are invited to add your own R content feed to this site (Non-English 
     R bloggers should add themselves- here)\nJOBS FOR R-USERS\nData/GIS Analyst for Ecoscape Environmental Consultants @ Kelowna, "

2 个答案:

答案 0 :(得分:1)

也许webdriver(是PhantomJS实现)会做得更好(目前无法针对RSelenium进行测试):

library("webdriver")
library("rvest")

pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
url <- 'https://www.r-bloggers.com'
ses$go(url)

res <- ses$getSource() %>% 
  read_html() %>%
  html_node('body') %>%
  html_text()

substring(res, 1, 500)
#> [1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!\t\t\t\n\n\n\n\nHere you will find daily news and tutorials about R, contributed by over 750 bloggers. \n\nThere are many ways to follow us - \nBy e-mail:\n\n\n<img src=\"https://feeds.feedburner.com/~fc/RBloggers?bg=99CCFF&amp;fg=444444&amp;anim=0\" height=\"26\" width=\"88\" sty"

答案 1 :(得分:0)

您可以尝试使用正则表达式清理数据,

url <- "https://www.r-bloggers.com"

res <- url %>% 
  read_html() %>% 
  html_nodes('body') %>%
  html_text()

library(stringr)

# clean up text data
res %>%
  str_replace_all(pattern = "\n", replacement = " ") %>%
  str_replace_all(pattern = "[\\^]", replacement = " ") %>%
  str_replace_all(pattern = "\"", replacement = " ") %>%
  str_replace_all(pattern = "\\s+", replacement = " ") %>%
  str_trim(side = "both")