R中的Web抓取-注释掉表

时间:2018-12-12 20:54:16

标签: r web-scraping

我正在尝试在https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml

中刮除决赛桌

即“ MLB详细排名”

我的R代码如下:

library(XML)
library(httr)
library(plyr)
library(stringr)

url <- paste0("http://www.baseball-reference.com/leagues/MLB/", 2015, "-standings.shtml")
tab <- GET(url)
data <- readHTMLTable(rawToChar(tab$content))

但是,它似乎没有拿起我想要的桌子。查看源代码,似乎该表以某种方式被注释掉了?

任何帮助都会很棒

1 个答案:

答案 0 :(得分:0)

MrFlick链接的答案中:

library(XML)
library(tidyverse)
library(rvest)   

page <- xml2::read_html("https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml")

alt_tables <- xml2::xml_find_all(page,"//comment()") %>% {
  #Find only commented nodes that contain the regex for html table markup
  raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
  # Remove the comment begin and end tags
  strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
                                                vectorize_all = FALSE)
  # Loop through the pieces that have tables within markup and 
  # apply the same functions
  lapply(grep("<table", strip_html, value = TRUE), function(i){
    rvest::html_table(xml_find_all(read_html(i), "//table")) %>% 
      .[[1]]
  })
}


tbl <- alt_tables[[2]]
tbl <- as.tibble(tbl)
tbl

# A tibble: 31 x 23
      Rk Tm    Lg        G     W     L `W-L%`     R    RA Rdiff   SOS   SRS pythWL  Luck Inter Home  Road  ExInn
   <int> <chr> <chr> <int> <int> <int>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <int> <chr> <chr> <chr> <chr>
 1     1 STL   NL      162   100    62  0.617   4     3.2   0.8  -0.3   0.5 96-66      4 11-9  55-26 45-36 8-8  
 2     2 PIT   NL      162    98    64  0.605   4.3   3.7   0.6  -0.3   0.3 93-69      5 13-7  53-28 45-36 12-9 
 3     3 CHC   NL      162    97    65  0.599   4.3   3.8   0.5  -0.3   0.2 90-72      7 10-10 49-32 48-33 13-5 
 4     4 KCR   AL      162    95    67  0.586   4.5   4     0.5   0.2   0.7 90-72      5 13-7  51-30 44-37 10-6 
 5     5 TOR   AL      162    93    69  0.574   5.5   4.1   1.4   0.2   1.6 102-60    -9 12-8  53-28 40-41 8-6  
 6     6 LAD   NL      162    92    70  0.568   4.1   3.7   0.4  -0.3   0.1 89-73      3 10-10 55-26 37-44 6-9  
 7     7 NYM   NL      162    90    72  0.556   4.2   3.8   0.4  -0.4   0   89-73      1 9-11  49-32 41-40 9-6  
 8     8 TEX   AL      162    88    74  0.543   4.6   4.5   0.1   0.2   0.4 83-79      5 11-9  43-38 45-36 5-4  
 9     9 NYY   AL      162    87    75  0.537   4.7   4.3   0.4   0.3   0.8 88-74     -1 11-9  45-36 42-39 4-9  
10    10 HOU   AL      162    86    76  0.531   4.5   3.8   0.7   0.2   0.9 93-69     -7 16-4  53-28 33-48 8-6  
# ... with 21 more rows, and 5 more variables: `1Run` <chr>, vRHP <chr>, vLHP <chr>, `≥.500` <chr>, `<.500` <chr>
>