我正在尝试在给定的区域和时间(例如“ Onslow”,“ Yesterday”)上抓取该网页(https://nc.211counts.org)。我想从该左上表中提取所有信息(COVID,房屋等通过其他)。不幸的是,选择过滤器后,URL不会更新。我一直在遵循教程here,但是找不到找到要抓取的区域名称位置的方法。由于html_nodes
函数返回的是空值,因此我认为映射已关闭。
我在这里想念什么?
# docker run -d -p 4445:4444 selenium/standalone-chrome
# docker ps
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://nc.211counts.org")
remDr$screenshot(display = TRUE)
nc211 <- xml2::read_html(remDr$getPageSource()[[1]])
str(nc211)
body_nodes <- nc211 %>%
html_node('body') %>%
html_children()
body_nodes
body_nodes %>%
html_children()
rank <- nc211 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'col-lg-12 chosen-select')]") %>%
rvest::html_text()
# this returns empty
nc211 %>%
rvest::html_nodes("#region") %>%
rvest::html_children() %>%
rvest::html_text()
# guessing at an element number to see what happens
element<- remDr$findElement(using = 'css selector', "#region > option:nth-child(1)")
element$clickElement()
答案 0 :(得分:1)
选择内容并按“搜索”时,内容会通过xhr POST请求动态更新。您可以使用“网络”标签来分析这些请求并重现它们,而无需求助于硒(作为替代方法)。您将需要从初始页面选择参数选项。
下面,我向您展示如何请求特定的邮政编码,以及如何找出所有邮政编码及其在请求中使用的相应参数ID。后者需要来自初始网址。
library(httr)
library(rvest)
data = list(
'id' = '{"ids":["315"]}', # zip 27006 is id 315 seen in value attribute of checkbox node
'timeIntervalId' = '18',
'centerId' = '7',
'type' = 'Z'
)
#post request that page makes using your filter selections e.g. zip code
r <- httr::POST(url = 'https://nc.211counts.org/dashBoard/barChart', body = data)
page <- read_html(r)
categories <- page %>% html_nodes(".categoriesDiv .toolTipSubCategory, #totalLabel") %>% html_text
colNodes <- page %>% html_nodes(".categoriesDiv .value")
percentages <- colNodes %>% html_attr('data-percentage')
counts <- colNodes %>% html_attr('data-value')
df <- as.data.frame(cbind(categories, percentages, counts))
print(df)
#Lookups e.g. zip codes. Taken from initial url
initial_page <- read_html('https://nc.211counts.org/')
ids <- initial_page %>% html_nodes('.zip [value]') %>% html_attr('value')
zips <- initial_page %>% html_nodes('.zip label') %>% html_text() %>% trimws()
print(ids[match('27006', zips)])