我正在尝试在https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml
中刮除决赛桌即“ MLB详细排名”
我的R代码如下:
library(XML)
library(httr)
library(plyr)
library(stringr)
url <- paste0("http://www.baseball-reference.com/leagues/MLB/", 2015, "-standings.shtml")
tab <- GET(url)
data <- readHTMLTable(rawToChar(tab$content))
但是,它似乎没有拿起我想要的桌子。查看源代码,似乎该表以某种方式被注释掉了?
任何帮助都会很棒
答案 0 :(得分:0)
MrFlick链接的答案中:
library(XML)
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml")
alt_tables <- xml2::xml_find_all(page,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
tbl <- alt_tables[[2]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 31 x 23
Rk Tm Lg G W L `W-L%` R RA Rdiff SOS SRS pythWL Luck Inter Home Road ExInn
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <chr> <chr> <chr> <chr>
1 1 STL NL 162 100 62 0.617 4 3.2 0.8 -0.3 0.5 96-66 4 11-9 55-26 45-36 8-8
2 2 PIT NL 162 98 64 0.605 4.3 3.7 0.6 -0.3 0.3 93-69 5 13-7 53-28 45-36 12-9
3 3 CHC NL 162 97 65 0.599 4.3 3.8 0.5 -0.3 0.2 90-72 7 10-10 49-32 48-33 13-5
4 4 KCR AL 162 95 67 0.586 4.5 4 0.5 0.2 0.7 90-72 5 13-7 51-30 44-37 10-6
5 5 TOR AL 162 93 69 0.574 5.5 4.1 1.4 0.2 1.6 102-60 -9 12-8 53-28 40-41 8-6
6 6 LAD NL 162 92 70 0.568 4.1 3.7 0.4 -0.3 0.1 89-73 3 10-10 55-26 37-44 6-9
7 7 NYM NL 162 90 72 0.556 4.2 3.8 0.4 -0.4 0 89-73 1 9-11 49-32 41-40 9-6
8 8 TEX AL 162 88 74 0.543 4.6 4.5 0.1 0.2 0.4 83-79 5 11-9 43-38 45-36 5-4
9 9 NYY AL 162 87 75 0.537 4.7 4.3 0.4 0.3 0.8 88-74 -1 11-9 45-36 42-39 4-9
10 10 HOU AL 162 86 76 0.531 4.5 3.8 0.7 0.2 0.9 93-69 -7 16-4 53-28 33-48 8-6
# ... with 21 more rows, and 5 more variables: `1Run` <chr>, vRHP <chr>, vLHP <chr>, `≥.500` <chr>, `<.500` <chr>
>