Question

我想使用R和rvest软件包在kaggle中获取数据集表的html节点，但未能到达该节点。

网址为https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=all&filetype=csv&license=all

我转到Chrome中的开发人员工具以获取节点的xpath。

html_node似乎无法使节点的深度超过特定水平，它返回NA。

library(tidyverse)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

base_url <- "https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=all&filetype=csv&license=all"

url_html <- read_html(base_url)
xpath_to_table <- "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[1]/div[2]/div[2]/div/div/div[1]"
xpath_this_works <- "/html/body/div[1]/div[2]/div"
xpath_this_fails <- "/html/body/div[1]/div[2]/div/div"

url_html %>% 
  html_node(xpath=xpath_to_table)
#> {xml_missing}
#> <NA>
# Should show many divs, one div per one row in table

url_html %>%
  html_node(xpath=xpath_this_works)
#> {xml_node}
#> <div data-component-name="DatasetList" style="display: flex; flex-direction: column; flex: 1 0 auto;">

url_html %>%
  html_node(xpath=xpath_this_fails)
#> {xml_missing}
#> <NA>

^{由reprex package（v0.2.1）于2019-03-29创建}

我希望html_node返回许多div，这是数据集表中每一行的一个div。

html_node无法在Kaggle网页中获取节点

0 个答案: