Rvest网站抓取错误-识别CSS还是XPath?

时间:2018-12-14 10:35:17

标签: r web-scraping rvest

Rwanda Cooperative Board上有一个数据库;它包含约155页我要访问的数据(无需滚动浏览整个网站)。我在使用R中的rvest包来标识正确的xpath或css时遇到问题。我也在使用selector gadget工具来帮助标识正确的节点。

我的问题是我收到一个'character(0)',暗示我没有在抓取正确的数据。

url <- 'http://www.rca.gov.rw/wemis/registration/all.php?start=0&status=approved'

html <- read_html(url)

rank_data_html <- html_nodes(html, css = '.primary td')

rank_data <- html_text(rank_data_html)

head(rank_data)

有没有一种方法可以更改代码以循环执行并抓取数据?

1 个答案:

答案 0 :(得分:6)

这与使用错误的选择器无关。您要抓取的网站在首次访问时会做一些非常有趣的事情:

enter image description here

点击页面时,它会设置一个cookie,然后刷新页面(这是我见过的强制执行“会话”的最愚蠢的方式之一。)

除非使用代理服务器捕获Web请求,否则即使在浏览器开发人员工具的“网络”选项卡中,也永远无法真正看到此请求。不过,通过查看您最初进行的read_html()调用返回的内容(它只有javascript + redirect),您也可以看到它。

read_html()httr::GET()都无法直接为您提供帮助,因为设置cookie的方式是通过javascript。

但是!希望并没有失去,也不需要像Selenium或Splash这样的愚蠢的第三方要求(令我震惊的是,驻地专家尚未提出建议,因为这似乎是最近的默认响应)。

让我们获取cookie(请确保这是自libcurl开始的FRESH,RESTARTED,NEW R会话,而curl使用的是httr::GET(),而{{ 1}}最终使用-维护cookie(我们将使用此功能来继续抓取页面,但是如果出现任何问题,您可能需要重新开始会话)。

read_html()

现在,我们将设置cookie并获取我们的PHP会话cookie,这两个事件将在之后继续存在:

library(xml2)
library(httr)
library(rvest)
library(janitor)

# Get access cookie

httr::GET(
  url = "http://www.rca.gov.rw/wemis/registration/all.php",
  query = list(
    start = "0",
    status = "approved"
  )
) -> res

ckie <- httr::content(res, as = "text", encoding = "UTF-8")
ckie <- unlist(strsplit(ckie, "\r\n"))
ckie <- grep("cookie", ckie, value = TRUE)
ckie <- gsub("^document.cookie = '_accessKey2=|'$", "", ckie)

现在,有超过400页,因此,如果您抓取了错误并需要重新解析页面,我们将缓存原始HTML。这样,您可以遍历文件然后再次访问该站点。为此,我们将为它们创建一个目录:

httr::GET(
  url = "http://www.rca.gov.rw/wemis/registration/all.php",
  httr::set_cookies(`_accessKey2` = ckie),
  query = list(
    start = "0",
    status = "approved"
  )
) -> res

现在,创建分页开始编号:

dir.create("rca-temp-scrape-dir")

然后,遍历它们。注意:我并不需要全部400多个页面,因此只需要10个页面。删除pgs <- seq(0L, 8920L, 20) 即可全部显示。另外,除非您不想伤害他人,否则请不要睡觉,因为您无需为CPU /带宽付费,而且该站点可能非常脆弱。

[1:10]

最后,我们将这20个数据帧合并为一个:

lapply(pgs[1:10], function(pg) { 

  Sys.sleep(5) # Please don't hammer servers you don't pay for

  httr::GET(
    url = "http://www.rca.gov.rw/wemis/registration/all.php",
    query = list(
      start = pg,
      status = "approved"
    )
  ) -> res

  # YOU SHOULD USE httr FUNCTIONS TO CHECK FOR STATUS
  # SINCE THERE CAN BE HTTR ERRORS THAT YOU MAY NEED TO 
  # HANDLE TO AVOID CRASHING THE ITERATION

  out <- httr::content(res, as = "text", encoding = "UTF-8")

  # THIS CACHES THE RAW HTML SO YOU CAN RE-SCRAPE IT FROM DISK IF NECESSARY  

  writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

  out <- xml2::read_html(out)
  out <- rvest::html_node(out, "table.primary")
  out <- rvest::html_table(out, header = TRUE, trim = TRUE)

  janitor::clean_names(out) # makes better column names

}) -> recs

如果您是recs <- do.call(rbind.data.frame, recs) str(recs) ## 'data.frame': 200 obs. of 9 variables: ## $ s_no : num 1 2 3 4 5 6 7 8 9 10 ... ## $ code : chr "BUG0416" "RBV0494" "GAS0575" "RSZ0375" ... ## $ name : chr "URUMURI RWA NGERUKA" "BADUKANA IBAKWE NYAKIRIBA" "UBUDASA COOPERATIVE" "KODUKB" ... ## $ certificate: chr "RCA/0734/2018" "RCA/0733/2018" "RCA/0732/2018" "RCA/0731/2018" ... ## $ reg_date : chr "10.12.2018" "-" "10.12.2018" "07.12.2018" ... ## $ province : chr "East" "West" "Mvk" "West" ... ## $ district : chr "Bugesera" "Rubavu" "Gasabo" "Rusizi" ... ## $ sector : chr "Ngeruka" "Nyakiliba" "Remera" "Bweyeye" ... ## $ activity : chr "ubuhinzi (Ibigori, Ibishyimbo)" "ubuhinzi (Imboga)" "transformation (Amasabuni)" "ubworozi (Amafi)" ... 用户,也可以这样做:

tidyverse

vs purrr::map_df(pgs[1:10], ~{ Sys.sleep(5) httr::GET( url = "http://www.rca.gov.rw/wemis/registration/all.php", httr::set_cookies(`_accessKey2` = ckie), query = list( start = .x, status = "approved" ) ) -> res out <- httr::content(res, as = "text", encoding = "UTF-8") writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg))) out <- xml2::read_html(out) out <- rvest::html_node(out, "table.primary") out <- rvest::html_table(out, header = TRUE, trim = TRUE) janitor::clean_names(out) }) -> recs 方法。