使用以下脚本一段时间后,它突然停止工作。我构建了一个简单的函数,在一个网页中找到一个基于xpath的表。
library(rvest)
url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'
find_table <- function(x){read_html(x) %>%
html_nodes(xpath = '//*[@id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()}
table <- find_table(url)
我还尝试在httr::GET
之前使用read_html
,传递以下参数:
query = list(r_date = "2017-12-22")
但没有改变。有什么想法吗?
答案 0 :(得分:0)
好吧,由于您错过了)
行中的url <-
,因此该代码无效。
我们将添加httr
:
library(httr)
library(rvest)
url
是基本函数的名称。使用基函数名作为变量会使代码中的问题难以调试。除非你写出完美的代码,否则不要那样使用这些名称。
URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')
我不知道您是否知道有关网页抓取的“规则”,但如果您对此网站重复提出请求,则应使用“抓取延迟”。他们的robots.txt中没有一套,所以5秒是可接受的选择。我指出这一点,因为你可能会受到限制。
find_table <- function(x, crawl_delay=5) {
Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets
# switch to httr::GET so you can get web server interaction info.
# since you're scraping, it's expected that you use a custom user agent
# that also supplies contact info.
res <- GET(x, user_agent("My scraper"))
# check to see if there's a non HTTP 200 response which there may be
# if you're getting rate-limited
stop_for_status(res)
# now, try to do the parsing. It looks like you're trying to target a
# single table, so i switched it from `html_nodes()` to `html_node()` since
# the latter returns a `list` and the pipe will error out if there's more
# than on list element.
content(res, "parsed") %>%
html_node(xpath = '//*[@id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()
}
table
也是基本函数名称(见上文)
result <- find_table(URL)
对我来说很好:
str(result)
## 'data.frame': 11 obs. of 5 variables:
## $ ENTI EROGATORI : chr "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
## $ : logi NA NA NA NA NA NA ...
## $ ACCENSIONE ACCERTAMENTI : chr "4.638.500,83" "0,00" "0,00" "953.898,47" ...
## $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
## $ RIMBORSO IMPEGNI : chr "438.696,57" "975,07" "45.584,55" "182.897,01" ...