当我试图让自己熟悉rvest
并抓住棒球榜时,@ Cory请{$ 3}}我到一个每个分区有一张桌子的网站。 (在棒球队,2个联赛x 3个分区= 6个桌子)。
library("rvest"); library("xml2")
read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
html_nodes("table") %>% html_table -> standings
但是这些表格不包括联赛和分组的列 - 这些信息是表格上方的标题<h4>
和<h5>
。
read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
html_nodes("h4") %>% html_text -> leagues
leagues # [1] "American League" "National League"
read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
html_nodes("h5") %>% html_text -> divs
divs # [1] "East" "Central" "West" "East" "Central" "West"
我知道我可以半手动分配联盟和分区:
for (i in 1:6){
standings[[i]]$League <- as.factor( leagues[ceiling(i/3)])
standings[[i]]$Division <- as.factor(divs[i])
}
standings <- do.call(rbind, standings) # desired output
我对手动分配很好,因为我怀疑这个结构会改变......但它让我思考..是否有一种聪明的方法让每个表继承/回顾{{1}的最新值}和<h4>
并存储为列?
TYVM
答案 0 :(得分:1)
如果你看一下我们正在使用的xml_children
......标题不是表格的“父母”......
library("rvest")
a <- read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
xml_children
> a
{xml_nodeset (14)}
[1] <h4 class="american-league">American League</h4>
[2] <h5 class="yom-sports-flavor-full full">East</h5>
[3] <table summary="East" class="yom-data yom-data-small yom-sports-flavor-fu ...
[4] <h5 class="yom-sports-flavor-full full">Central</h5>
[5] <table summary="Central" class="yom-data yom-data-small yom-sports-flavor ...
[6] <h5 class="yom-sports-flavor-full full">West</h5>
[7] <table summary="West" class="yom-data yom-data-small yom-sports-flavor-fu ...
[8] <h4 class="national-league">National League</h4>
[9] <h5 class="yom-sports-flavor-full full">East</h5>
[10] <table summary="East" class="yom-data yom-data-small yom-sports-flavor-fu ...
[11] <h5 class="yom-sports-flavor-full full">Central</h5>
[12] <table summary="Central" class="yom-data yom-data-small yom-sports-flavor ...
[13] <h5 class="yom-sports-flavor-full full">West</h5>
[14] <table summary="West" class="yom-data yom-data-small yom-sports-flavor-fu ...
因此,循环通过该结构应该保持它们切换AL或NL的顺序或者某种东西......
data <- list()
for(i in 1:6){
if(i < 4){
title <- paste(a[[1]] %>% html_text, a[[i*2]] %>% html_text)
data[[title]] <- a[[i*2+1]] %>% html_table
}else{
title <- paste(a[[8]] %>% html_text, a[[i*2+1]] %>% html_text)
data[[title]] <- a[[i*2+2]] %>% html_table
}
}
> data
$`American League East`
Team W L Pct GB Home Away Streak RS RA Diff L10
1 Tampa Bay 36 30 0.545 -- 19-19 17-11 L-1 246 239 7 6-4
2 NY Yankees 34 30 0.531 1.0 16-11 18-19 L-2 290 278 12 5-5
3 Baltimore 33 31 0.516 2.0 22-13 11-18 W-2 288 253 35 8-2
4 Toronto 34 32 0.515 2.0 20-12 14-20 L-2 361 292 69 8-2
5 Boston 28 38 0.424 8.0 16-18 12-20 W-1 258 315 -57 3-7
$`American League Central`
Team W L Pct GB Home Away Streak RS RA Diff L10
1 Kansas City 36 25 0.590 -- 19-11 17-14 W-2 266 213 53 6-4
2 Minnesota 34 30 0.531 3.5 20-12 14-18 L-2 267 267 0 2-8
3 Detroit 34 31 0.523 4.0 18-18 16-13 L-1 272 265 7 6-4
4 Cleveland 30 33 0.476 7.0 12-18 18-15 W-1 265 269 -4 4-6
5 Chi White Sox 28 35 0.444 9.0 16-12 12-23 L-5 221 288 -67 3-7
$`American League West`
Team W L Pct GB Home Away Streak RS RA Diff L10
1 Houston 38 28 0.576 -- 23-14 15-14 W-3 287 254 33 4-6
2 Texas 35 30 0.538 2.5 15-16 20-14 W-2 296 281 15 6-4
3 LA Angels 33 32 0.508 4.5 19-15 14-17 W-1 255 258 -3 5-5
4 Seattle 29 36 0.446 8.5 13-19 16-17 L-1 223 268 -45 5-5
5 Oakland 28 39 0.418 10.5 11-18 17-21 W-3 288 264 24 5-5
$`National League East`
Team W L Pct GB Home Away Streak RS RA Diff L10
1 NY Mets 36 30 0.545 -- 26-11 10-19 W-3 254 252 2 6-4
2 Washington 34 31 0.523 1.5 16-12 18-19 W-1 289 280 9 4-6
3 Atlanta 31 34 0.477 4.5 15-14 16-20 L-1 278 295 -17 4-6
4 Miami 29 37 0.439 7.0 17-17 12-20 W-2 259 270 -11 6-4
5 Philadelphia 22 44 0.333 14.0 15-16 7-28 L-8 200 311 -111 1-9
$`National League Central`
Team W L Pct GB Home Away Streak RS RA Diff L10
1 St. Louis 43 21 0.672 -- 26-7 17-14 W-5 253 183 70 7-3
2 Pittsburgh 37 27 0.578 6.0 21-11 16-16 W-6 264 205 59 7-3
3 Chi Cubs 34 28 0.548 8.0 18-13 16-15 L-1 250 250 0 6-4
4 Cincinnati 29 35 0.453 14.0 17-13 12-22 W-1 257 277 -20 6-4
5 Milwaukee 24 42 0.364 20.0 11-24 13-18 L-4 241 313 -72 4-6
$`National League West`
Team W L Pct GB Home Away Streak RS RA Diff L10
1 LA Dodgers 37 28 0.569 -- 25-10 12-18 L-2 281 225 56 6-4
2 San Francisco 35 31 0.530 2.5 17-18 18-13 W-1 265 259 6 4-6
3 Arizona 31 33 0.484 5.5 15-16 16-17 L-1 291 285 6 5-5
4 San Diego 32 35 0.478 6.0 16-19 16-16 L-3 284 299 -15 3-7
5 Colorado 28 36 0.438 8.5 13-18 15-18 L-2 277 318 -41 3-7