从HTML网页获取节点以使用R抓取URL

时间:2020-02-11 09:23:50

标签: html r rvest

https://i.stack.imgur.com/xeczg.png

我正在尝试从网页'https://www.sgcarmart.com/main/index.php'

获取节点'.2lines'下的URL。
library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()

我收到html_nodes函数错误:

Error in parse_simple_selector(stream) : 
  Expected selector, got <NUMBER '.2' at 1>

如何解决此错误?

1 个答案:

答案 0 :(得分:0)

您可以使用xpath选择器找到所需的节点。链接实际上包含在您试图按类引用的<a>标签内的<p>标签中。您可以在单个xpath中访问它们:

library(rvest)

site <- 'https://www.sgcarmart.com'

urls <-  site                                           %>%
         paste0("/main/index.php")                      %>%
         read_html()                                    %>% 
         html_nodes(xpath = "//*[@class = '2lines']/a") %>% 
         html_attr("href")                              %>%
         {paste0(site, .)}

urls
#>  [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#>  [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#>  [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#>  [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#>  [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#>  [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#>  [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#>  [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#>  [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"