Question

我想确定页面上分页的页数： https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false

============
Table
============
     Pagination: Link1, Link2, Link3, Link4, LinkNext,Link Last

使用选择器小工具我确定分页位于“.pagination-container，a”

我想

将分页中的所有链接转储到vector或data.frame
获取网址字符串中的最后一个数字
确定最大数字，表示分页中有多少页面，以便稍后在抓取循环中使用它

关注http://francojc.github.io/web-scraping-with-rvest/

我从

library(tidyverse)
library(rvest)

url <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false"

urls <- url %>% # feed `main.page` to the next step
  html_nodes(".pagination-container, a") %>% # get the CSS nodes
  html_text("href")

在html_nodes上会抛出错误

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"

我做错了什么？

Answer 1

超越＆＃34;拼写错误＆＃34; （即错过了对read_html()的号召），这是获得总页数的更简单方法。只需定位参与者中的[>>]链接：

library(rvest)
library(stringi)
library(tidyverse)

url <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false"

pg <- read_html(url)

html_nodes(pg, "li.PagedList-skipToLast > a") %>% 
  html_attr("href") %>% 
  stri_match_last_regex("page=([[:digit:]]+)") %>% 
  .[,2]
## [1] "13"

rvest：从css节点获取链接错误：'xml_find_all'没有适用的方法

1 个答案: