在R中读取html表

时间:2015-03-06 10:15:47

标签: html r parsing

我尝试使用包XML从R中的tennis abstract网页读取head2head数据。

我想要底部的大h2h表,
css选择器html > body > div#main > table#maintable > tbody > tr > td#stats > table#matches.tablesorter

我尝试过以下scraping html into r data frame的建议 我认为困难是由表

中的表引起的
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
library(RCurl)
library(XML)

webpage <- getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)  #doesnt have the h2h table
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
results <- xpathSApply(pagetree, "//*/table[@class='tablesorter']/tr/td", xmlValue)  # gives NULL

tables <- readHTMLTable( url,stringsAsFactors=T) # has 4 tables, not the desired one

我是html解析的新手,所以请耐心等待。

1 个答案:

答案 0 :(得分:3)

这不是最有效的,但它可以完成这项工作。

library(rvest)
library(RSelenium)

tennis.url <- "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"

checkForServer(); startServer()
remDrv <- remoteDriver()
remDrv$open()

remDrv$navigate(tennis.url)
tennis.html <- html(remDrv$getPageSource()[[1]])

remDrv$close()

H2Hs <- tennis.html %>% html_nodes(".h2hclick") %>% html_text %>% as.numeric
Opponent <- tennis.html %>% html_nodes("#matches a") %>% html_text
Country <- tennis.html %>% html_nodes("a+ span") %>% html_text %>% gsub("[^(A-Z)]", "", .)
W <- tennis.html %>% html_nodes("#matches td:nth-child(3)") %>% .[-1] %>% html_text %>% as.numeric
L <- tennis.html %>% html_nodes("#matches td:nth-child(4)") %>% .[-1] %>% html_text %>% as.numeric
Win.Prc <- tennis.html %>% html_nodes("#matches td:nth-child(5)") %>% .[-1] %>% html_text

等等。您只需要在nth-child(#)中增加#,然后创建一个数据框。