大数据(~90k)XPath Scraping

时间:2017-05-08 23:26:48

标签: r xpath web-scraping

我正在寻找一些有效的解决方案来从the Vermont Secretaty of State抓取清理过的xpath几千次迭代。这是我尝试刮擦的标题的xpath:

'//*[@id="content_wrapper"]/div[2]/div/h1'

我努力寻找一种干净有效的方法来运行循环,循环大约90000页,抓取标题,并将其存储在矢量中。最终目标是导出包含页面值和标题xpath的小数据框。我将使用此数据框来索引数据库中的未来搜索。

这是我到目前为止所得到的:

library(XML)
library(rvest)

election_value <- 1:90000
title <- NA

for (i in 1:90000) {
  url <- sprintf("http://vtelectionarchive.sec.state.vt.us/elections/view/%s", election_value[i])
  if (is.null(tryCatch({read_html(url) %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1')  %>% html_text()}, error=function(e){}))) {
    title[i] <- NA } else {
      title[i] <- read_html(url) %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1')}
}
vermont_titles <- data.frame(election_value, title)
write.csv(vermont_titles, 'vermont_titles.csv')

不幸的是,脚本不起作用,因为html_nodes()函数返回带括号的字符串,而不仅仅是文本。任何解决方案都会受到赞赏,因为这个脚本已经让我烦恼了一个星期左右。

2 个答案:

答案 0 :(得分:2)

这是一个有效的解决方案。有关其他详细信息,请参阅注释:

library(rvest)

#url<-"http://vtelectionarchive.sec.state.vt.us/elections/view/68156"
election_value <- 68150:68199

#predefine title vector
title <- vector("character", length=length(election_value))

for (i in 1:50) {
  url <- sprintf("http://vtelectionarchive.sec.state.vt.us/elections/view/%s", election_value[i])
  #read page  and test if null
  page<-tryCatch({read_html(url)}, error=function(e){})
  if (is.null(page)) 
  {
      title[i] <- NA } 
  else {
    #parse the page and extract the title as text
    node<-page %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1')
    title[i] <- node %>% html_text()
  }
}
vermont_titles <- data.frame(election_value, title)
write.csv(vermont_titles, 'vermont_titles.csv')

一些注意事项:一次读取页面而不是两次,只解析一次页面将提高性能。将标题预定义为向量也是另一种性能提升。

答案 1 :(得分:2)

另一种解决方案可能是:

require(tidyverse)
require(rvest)
election_value <- c(3,68150:68153)
base_url <- "http://vtelectionarchive.sec.state.vt.us/elections/view/"
urls <- paste0(base_url, election_value)

map(urls, possibly(read_html, NA_character_)) %>% 
  map_if(negate(is.na), html_nodes, xpath = '//*[@id="content_wrapper"]/div[2]/div/h1') %>% 
  map_if(negate(is.na), html_text) %>% 
  as.character %>% 
  tibble(election_value, title = .)

返回:

# A tibble: 5 × 2
  election_value                                                 title
           <dbl>                                                 <chr>
1              3                                                    NA
2          68150    2014 Probate Judge General Election Rutland County
3          68151    2014 Probate Judge General Election Orleans County
4          68152 2014 Probate Judge General Election Grand Isle County
5          68153   2014 Probate Judge General Election Lamoille County