我正在尝试scrape this website使用R中的rvest包。我已经与其他几个网站成功完成了这个但是这个似乎没有用,我不知道为什么。
我从chrome的检查工具中复制了xpath,但是当我在rvest脚本中指定它时,它表明它不存在。它是否与生成表而不是静态的事实有关?
感谢帮助!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[@id='indicator10']")
a<-html_table(a)
a
答案 0 :(得分:0)
关于您的问题,是的,因为它是动态生成的,所以您无法获得它。在这些情况下,最好使用RSelenium
库:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[@id='indicator10']")
我猜您正在尝试获取第一个表。在这种情况下,最好使用read_table
来获得表:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
这次我使用的是CSS选择器,而不是XPath。
希望有帮助!刮刮乐!