无法使用Rvest提取图像链接

时间:2019-07-31 05:02:15

标签: r image web-scraping rvest

我无法从网站提取图像链接。

我不熟悉数据抓取。我已经使用Selectorgadget以及inspect元素方法来获取图像的类,但无济于事。

main.page <- read_html(x= "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
urls <- main.page %>% 
  html_nodes(".match-detail--item:nth-child(9) .lazyloaded") %>%
  html_attr("src")

sotu <- data.frame(urls = urls)

我得到以下输出:

<0 rows> (or 0-length row.names)

2 个答案:

答案 0 :(得分:2)

由于某些原因,某些类和参数未显示在抓取的数据中。只需定位img而不是.lazyloadeddata-src而不是src

library(rvest)

main.page <- read_html("https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")

main.page %>% 
    html_nodes(".match-detail--item:nth-child(9) img") %>%
    html_attr("data-src")

#### OUTPUT ####

 [1] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/1.png&h=25&w=25"
 [2] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [3] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [4] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [5] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [6] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [7] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [8] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
 [9] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[10] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[11] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[12] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"

答案 1 :(得分:0)

由于使用浏览器时通过javascript(使用React)通过DOM修改了DOM,因此您无法获得rvest的相同布局。不太理想的是,您可以将链接所在的javascript对象中的信息进行正则表达式。然后使用json解析器提取链接

library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)

url <- "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974"

r <- read_html(url) %>% 
  html_nodes('body') %>% 
  html_text() %>% 
  toString()

x <- str_match_all(r,'debuts":(.*?\\])')  
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$imgicon)