rvest将章节标题转换成表格

时间:2015-06-16 13:57:47

标签: r web-scraping rvest

当我试图让自己熟悉rvest并抓住棒球榜时,@ Cory请{$ 3}}我到一个每个分区有一张桌子的网站。 (在棒球队,2个联赛x 3个分区= 6个桌子)。

  library("rvest"); library("xml2")
  read_html("http://sports.yahoo.com/mlb/standings/") %>% 
    html_nodes(".yui3-tabview-content") %>% 
    html_nodes("table") %>% html_table -> standings

但是这些表格不包括联赛和分组的列 - 这些信息是表格上方的标题<h4><h5>

read_html("http://sports.yahoo.com/mlb/standings/") %>% 
  html_nodes(".yui3-tabview-content") %>%
  html_nodes("h4") %>% html_text -> leagues
  leagues # [1] "American League" "National League"

read_html("http://sports.yahoo.com/mlb/standings/") %>% 
  html_nodes(".yui3-tabview-content") %>%
  html_nodes("h5") %>% html_text -> divs 
  divs # [1] "East"    "Central" "West"    "East"    "Central" "West"

我知道我可以半手动分配联盟和分区:

for (i in 1:6){
  standings[[i]]$League <- as.factor( leagues[ceiling(i/3)])
  standings[[i]]$Division <- as.factor(divs[i]) 
}
standings <- do.call(rbind, standings) # desired output

我对手动分配很好,因为我怀疑这个结构会改变......但它让我思考..是否有一种聪明的方法让每个表继承/回顾{{1}的最新值}和<h4>并存储为列?

TYVM

1 个答案:

答案 0 :(得分:1)

如果你看一下我们正在使用的xml_children ......标题不是表格的“父母”......

library("rvest")
a <- read_html("http://sports.yahoo.com/mlb/standings/") %>% 
       html_nodes(".yui3-tabview-content") %>% 
       xml_children

> a
{xml_nodeset (14)}
 [1] <h4 class="american-league">American League</h4>
 [2] <h5 class="yom-sports-flavor-full full">East</h5>
 [3] <table summary="East" class="yom-data yom-data-small yom-sports-flavor-fu ...
 [4] <h5 class="yom-sports-flavor-full full">Central</h5>
 [5] <table summary="Central" class="yom-data yom-data-small yom-sports-flavor ...
 [6] <h5 class="yom-sports-flavor-full full">West</h5>
 [7] <table summary="West" class="yom-data yom-data-small yom-sports-flavor-fu ...
 [8] <h4 class="national-league">National League</h4>
 [9] <h5 class="yom-sports-flavor-full full">East</h5>
[10] <table summary="East" class="yom-data yom-data-small yom-sports-flavor-fu ...
[11] <h5 class="yom-sports-flavor-full full">Central</h5>
[12] <table summary="Central" class="yom-data yom-data-small yom-sports-flavor ...
[13] <h5 class="yom-sports-flavor-full full">West</h5>
[14] <table summary="West" class="yom-data yom-data-small yom-sports-flavor-fu ...

因此,循环通过该结构应该保持它们切换AL或NL的顺序或者某种东西......

data <- list()
for(i in 1:6){
  if(i < 4){
    title <- paste(a[[1]] %>% html_text, a[[i*2]] %>% html_text)
    data[[title]] <- a[[i*2+1]] %>% html_table  
  }else{
    title <- paste(a[[8]] %>% html_text, a[[i*2+1]] %>% html_text)
    data[[title]] <- a[[i*2+2]] %>% html_table  
  }
}
> data
$`American League East`
        Team  W  L   Pct  GB  Home  Away Streak  RS  RA Diff L10
1  Tampa Bay 36 30 0.545  -- 19-19 17-11    L-1 246 239    7 6-4
2 NY Yankees 34 30 0.531 1.0 16-11 18-19    L-2 290 278   12 5-5
3  Baltimore 33 31 0.516 2.0 22-13 11-18    W-2 288 253   35 8-2
4    Toronto 34 32 0.515 2.0 20-12 14-20    L-2 361 292   69 8-2
5     Boston 28 38 0.424 8.0 16-18 12-20    W-1 258 315  -57 3-7

$`American League Central`
           Team  W  L   Pct  GB  Home  Away Streak  RS  RA Diff L10
1   Kansas City 36 25 0.590  -- 19-11 17-14    W-2 266 213   53 6-4
2     Minnesota 34 30 0.531 3.5 20-12 14-18    L-2 267 267    0 2-8
3       Detroit 34 31 0.523 4.0 18-18 16-13    L-1 272 265    7 6-4
4     Cleveland 30 33 0.476 7.0 12-18 18-15    W-1 265 269   -4 4-6
5 Chi White Sox 28 35 0.444 9.0 16-12 12-23    L-5 221 288  -67 3-7

$`American League West`
       Team  W  L   Pct   GB  Home  Away Streak  RS  RA Diff L10
1   Houston 38 28 0.576   -- 23-14 15-14    W-3 287 254   33 4-6
2     Texas 35 30 0.538  2.5 15-16 20-14    W-2 296 281   15 6-4
3 LA Angels 33 32 0.508  4.5 19-15 14-17    W-1 255 258   -3 5-5
4   Seattle 29 36 0.446  8.5 13-19 16-17    L-1 223 268  -45 5-5
5   Oakland 28 39 0.418 10.5 11-18 17-21    W-3 288 264   24 5-5

$`National League East`
          Team  W  L   Pct   GB  Home  Away Streak  RS  RA Diff L10
1      NY Mets 36 30 0.545   -- 26-11 10-19    W-3 254 252    2 6-4
2   Washington 34 31 0.523  1.5 16-12 18-19    W-1 289 280    9 4-6
3      Atlanta 31 34 0.477  4.5 15-14 16-20    L-1 278 295  -17 4-6
4        Miami 29 37 0.439  7.0 17-17 12-20    W-2 259 270  -11 6-4
5 Philadelphia 22 44 0.333 14.0 15-16  7-28    L-8 200 311 -111 1-9

$`National League Central`
        Team  W  L   Pct   GB  Home  Away Streak  RS  RA Diff L10
1  St. Louis 43 21 0.672   --  26-7 17-14    W-5 253 183   70 7-3
2 Pittsburgh 37 27 0.578  6.0 21-11 16-16    W-6 264 205   59 7-3
3   Chi Cubs 34 28 0.548  8.0 18-13 16-15    L-1 250 250    0 6-4
4 Cincinnati 29 35 0.453 14.0 17-13 12-22    W-1 257 277  -20 6-4
5  Milwaukee 24 42 0.364 20.0 11-24 13-18    L-4 241 313  -72 4-6

$`National League West`
           Team  W  L   Pct  GB  Home  Away Streak  RS  RA Diff L10
1    LA Dodgers 37 28 0.569  -- 25-10 12-18    L-2 281 225   56 6-4
2 San Francisco 35 31 0.530 2.5 17-18 18-13    W-1 265 259    6 4-6
3       Arizona 31 33 0.484 5.5 15-16 16-17    L-1 291 285    6 5-5
4     San Diego 32 35 0.478 6.0 16-19 16-16    L-3 284 299  -15 3-7
5      Colorado 28 36 0.438 8.5 13-18 15-18    L-2 277 318  -41 3-7