我想使用R和rvest软件包在kaggle中获取数据集表的html节点,但未能到达该节点。
我转到Chrome中的开发人员工具以获取节点的xpath。
html_node似乎无法使节点的深度超过特定水平,它返回NA。
library(tidyverse)
library(rvest)
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
base_url <- "https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=all&filetype=csv&license=all"
url_html <- read_html(base_url)
xpath_to_table <- "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[1]/div[2]/div[2]/div/div/div[1]"
xpath_this_works <- "/html/body/div[1]/div[2]/div"
xpath_this_fails <- "/html/body/div[1]/div[2]/div/div"
url_html %>%
html_node(xpath=xpath_to_table)
#> {xml_missing}
#> <NA>
# Should show many divs, one div per one row in table
url_html %>%
html_node(xpath=xpath_this_works)
#> {xml_node}
#> <div data-component-name="DatasetList" style="display: flex; flex-direction: column; flex: 1 0 auto;">
url_html %>%
html_node(xpath=xpath_this_fails)
#> {xml_missing}
#> <NA>
由reprex package(v0.2.1)于2019-03-29创建
我希望html_node返回许多div,这是数据集表中每一行的一个div。