Web-scrapping Rvest - How to capture the full `href` URL from shortened url

时间:2019-03-19 14:45:00

标签: r web-scraping rvest

I am trying to data from the web that contains a table and links. I can successfully download the table with the link text "score". However, instead of the shortened url, I would like to capture the full href URL.

However, I guess I get shorten URL with rvest. I don't know how can get full 'url' which I can loop over as below to get desired data and then convert everything into data frame.

library(rvest)
    # Load the page
    odi_score_url <- read_html('http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2019;type=year')


    urls <- odi_score_url %>% 
        html_nodes('td:nth-child(7) .data-link') %>%
        html_attr("href")

    links <- odi_score_url  %>% 
        html_nodes('td:nth-child(7) .data-link') %>%
        html_text()

    # Combine `links` and `urls` into a data.frame
    score_df <- data.frame(links = links, urls = urls, stringsAsFactors = FALSE)
    head(score_df)
       links                          urls
1 ODI # 4074 /ci/engine/match/1153840.html
2 ODI # 4075 /ci/engine/match/1153841.html
3 ODI # 4076 /ci/engine/match/1153842.html
4 ODI # 4077 /ci/engine/match/1144997.html
5 ODI # 4078 /ci/engine/match/1144998.html
6 ODI # 4079 /ci/engine/match/1144999.html

Loop over each row in score_df and get required data

    for(i in score_df) {
        text <- read_html(score_df$urls[i]) %>% # load the page
            html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 , 
                   .stadium-details+ .match-detail--item span , .stadium-details , 
                   .stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
            html_text() # get the text
        ## Create the dataframe

    }

I would appreciate your help!!!

Thanks in advance

1 个答案:

答案 0 :(得分:0)

The urls are relative to the main page. So you get the full urls by adding http://stats.espncricinfo.com/ at the beginning of the links. So, for example:

urls <- odi_score_url %>% 
  html_nodes('td:nth-child(7) .data-link') %>%
  html_attr("href") %>% 
  paste0("http://stats.espncricinfo.com/", .)

Then you can write the loop as:

text_list <- list()
for(i in seq_along(score_df$urls)) {
  text_list[[i]] <- read_html(score_df$urls[i]) %>% # load the page
    html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 , 
                   .stadium-details+ .match-detail--item span , .stadium-details , 
                   .stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
    html_text() # get the text
  # give some nice status
  cat("Scraping link", i, "\n")
}

Or, even better, as an apply loop:

text_list <- lapply(score_df$urls, function(x) {
  text <- read_html(x) %>% # load the page
    html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 , 
                   .stadium-details+ .match-detail--item span , .stadium-details , 
                   .stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
    html_text()
  data.frame(url = x, text = text, stringsAsFactors = FALSE)
  cat("Scraping link", x, "\n")
})

Then we can use dplyr to convert this to a data.frame:

text_df <- dplyr::bind_rows(text_list)
head(text_df)
                                                          url           text
1 http://stats.espncricinfo.com//ci/engine/match/1153840.html    New Zealand
2 http://stats.espncricinfo.com//ci/engine/match/1153840.html          371/7
3 http://stats.espncricinfo.com//ci/engine/match/1153840.html      Sri Lanka
4 http://stats.espncricinfo.com//ci/engine/match/1153840.html 326 (49/50 ov)
5 http://stats.espncricinfo.com//ci/engine/match/1153840.html    New Zealand
6 http://stats.espncricinfo.com//ci/engine/match/1153840.html          371/7

Not sure if this is already what you want. Maye you want to collapse the text, so there is only one row per url. But I think that should be easy enough to figure out, if you want it.