我尝试使用包XML从R中的tennis abstract网页读取head2head数据。
我想要底部的大h2h表,
css选择器:html > body > div#main > table#maintable > tbody > tr > td#stats > table#matches.tablesorter
我尝试过以下scraping html into r data frame的建议 我认为困难是由表
中的表引起的url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
library(RCurl)
library(XML)
webpage <- getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc) #doesnt have the h2h table
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
results <- xpathSApply(pagetree, "//*/table[@class='tablesorter']/tr/td", xmlValue) # gives NULL
tables <- readHTMLTable( url,stringsAsFactors=T) # has 4 tables, not the desired one
我是html解析的新手,所以请耐心等待。
答案 0 :(得分:3)
这不是最有效的,但它可以完成这项工作。
library(rvest)
library(RSelenium)
tennis.url <- "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
checkForServer(); startServer()
remDrv <- remoteDriver()
remDrv$open()
remDrv$navigate(tennis.url)
tennis.html <- html(remDrv$getPageSource()[[1]])
remDrv$close()
H2Hs <- tennis.html %>% html_nodes(".h2hclick") %>% html_text %>% as.numeric
Opponent <- tennis.html %>% html_nodes("#matches a") %>% html_text
Country <- tennis.html %>% html_nodes("a+ span") %>% html_text %>% gsub("[^(A-Z)]", "", .)
W <- tennis.html %>% html_nodes("#matches td:nth-child(3)") %>% .[-1] %>% html_text %>% as.numeric
L <- tennis.html %>% html_nodes("#matches td:nth-child(4)") %>% .[-1] %>% html_text %>% as.numeric
Win.Prc <- tennis.html %>% html_nodes("#matches td:nth-child(5)") %>% .[-1] %>% html_text
等等。您只需要在nth-child(#)
中增加#,然后创建一个数据框。