R - Scraping - rvest error : Subscript out of bound

时间:2018-09-09 16:33:11

标签: r screen-scraping rvest

I am trying to loop through a list of horsenumber, with the base url pasted after (horseno=). However, many of the time I either get back Subscript out of Bound error, or return a character(0).

library(rvest)
library(tidyverse)

horsenumber <- c("S385" "T436" "B016" "V102" "B121" "A370" "V026" "V107" "V086" "A082" "T267" "B059" "T118" "V077" "S393" "T230" "A061" "B387" "T370" "B165" "B326"
     [22] "B317" "B159" "B353" "T029" "T233" "A357" "A334" "A235" "T412" "V074" "B133" "T022" "A195" "T253" "A233" "V338" "B182" "A071" "V407" "B197" "B421"
     [43] "A427" "T282" "A359" "A069" "A097" "A351" "S397" "A305" "T112" "V334" "S204" "P421" "S277" "B141" "A333" "T380" "A005" "A189" "A314" "V381" "S420"
     [64] "A419" "V243" "A284" "S388" "A125" "B370" "A408" "A057" "A086" "B242" "A424" "B292" "T388" "V072" "V250" "A177" "T134" "A067" "A074" "A417" "B265"
     [85] "B170" "T419" "T389" "B080" "B300" "V336" "B119" "B204" "B144" "B260" "B350" "B056" "A150" "B209" "T200" "B149" "B249" "T349")

data <- lapply(paste0('http://racing.hkjc.com/racing/information/english/horse/horse.aspx?horseno=', horsenumber),
                function(url){
                      horsename <- url %>% read_html() %>% 
                        html_nodes(".title_text") %>% 
                        html_text()
                      horsename
                      age <- url %>% read_html() %>% 
                        html_nodes("td tr:nth-child(1) td:nth-child(2) span") %>% 
                        html_text()
                      age
                      sex <- url %>% read_html() %>% 
                        html_nodes("tr:nth-child(2) td:nth-child(2) span") %>% 
                        html_text()
                      sex
                      rhistory <- url %>% read_html() %>% 
                        html_nodes("tr:nth-child(6) td:nth-child(2) span.table_eng_text") %>% 
                        html_text()
                      rhistory
                      r10day <- url %>% read_html() %>% 
                        html_nodes("tr:nth-child(7) td:nth-child(2) span.table_eng_text") %>% 
                        html_text()
                      r10day
                      rating <- url %>% read_html() %>% 
                        html_nodes("tr:nth-child(3) td:nth-child(4) .table_eng_text") %>% 
                        html_text()
                      rating
                      data <- rbind(horsename,age,sex,rhistory,r10day,rating)
                      rbind(data)
                    })

In addition to that, I tried to use the following to scrape that particular table and turn it to dataframe for data mining. However, I also received Error in .[[6]] : subscript out of bounds.

horse_info <- page %>%
  html_nodes('table') %>%
  .[6] %>%
  html_table(fill=TRUE)
horse_info

Much appreciated

0 个答案:

没有答案