我正在寻找一些有效的解决方案来从the Vermont Secretaty of State抓取清理过的xpath几千次迭代。这是我尝试刮擦的标题的xpath:
'//*[@id="content_wrapper"]/div[2]/div/h1'
我努力寻找一种干净有效的方法来运行循环,循环大约90000页,抓取标题,并将其存储在矢量中。最终目标是导出包含页面值和标题xpath的小数据框。我将使用此数据框来索引数据库中的未来搜索。
这是我到目前为止所得到的:
library(XML)
library(rvest)
election_value <- 1:90000
title <- NA
for (i in 1:90000) {
url <- sprintf("http://vtelectionarchive.sec.state.vt.us/elections/view/%s", election_value[i])
if (is.null(tryCatch({read_html(url) %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1') %>% html_text()}, error=function(e){}))) {
title[i] <- NA } else {
title[i] <- read_html(url) %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1')}
}
vermont_titles <- data.frame(election_value, title)
write.csv(vermont_titles, 'vermont_titles.csv')
不幸的是,脚本不起作用,因为html_nodes()函数返回带括号的字符串,而不仅仅是文本。任何解决方案都会受到赞赏,因为这个脚本已经让我烦恼了一个星期左右。
答案 0 :(得分:2)
这是一个有效的解决方案。有关其他详细信息,请参阅注释:
library(rvest)
#url<-"http://vtelectionarchive.sec.state.vt.us/elections/view/68156"
election_value <- 68150:68199
#predefine title vector
title <- vector("character", length=length(election_value))
for (i in 1:50) {
url <- sprintf("http://vtelectionarchive.sec.state.vt.us/elections/view/%s", election_value[i])
#read page and test if null
page<-tryCatch({read_html(url)}, error=function(e){})
if (is.null(page))
{
title[i] <- NA }
else {
#parse the page and extract the title as text
node<-page %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1')
title[i] <- node %>% html_text()
}
}
vermont_titles <- data.frame(election_value, title)
write.csv(vermont_titles, 'vermont_titles.csv')
一些注意事项:一次读取页面而不是两次,只解析一次页面将提高性能。将标题预定义为向量也是另一种性能提升。
答案 1 :(得分:2)
另一种解决方案可能是:
require(tidyverse)
require(rvest)
election_value <- c(3,68150:68153)
base_url <- "http://vtelectionarchive.sec.state.vt.us/elections/view/"
urls <- paste0(base_url, election_value)
map(urls, possibly(read_html, NA_character_)) %>%
map_if(negate(is.na), html_nodes, xpath = '//*[@id="content_wrapper"]/div[2]/div/h1') %>%
map_if(negate(is.na), html_text) %>%
as.character %>%
tibble(election_value, title = .)
返回:
# A tibble: 5 × 2
election_value title
<dbl> <chr>
1 3 NA
2 68150 2014 Probate Judge General Election Rutland County
3 68151 2014 Probate Judge General Election Orleans County
4 68152 2014 Probate Judge General Election Grand Isle County
5 68153 2014 Probate Judge General Election Lamoille County