我想通过列表提取"匹配匹配"来自
的表格http://stats.espncricinfo.com/ci/engine/player/50710.html?class=2;template=results;type=batting;view=match
我是R的新手,所以不太了解从网页中提取数据。我用这段代码来提取表格。
fileUrl<- "http://stats.espncricinfo.com/ci/engine/player/50710.html?class=2;template=results;type=batting;view=match"
#load
sanga <-htmlTreeParse(fileUrl, useInternal=T)
sanga.data <-xpathSApply(sanga,"//tr[@class='data1']",xmlValue)
但是我最终得到一个列矩阵,其中原始表中的每一列都表示为一行。我读了这个帖子中的信息,但仍然无法弄清楚如何以表格格式获取数据。 Scraping html tables into R data frames using the XML package
答案 0 :(得分:0)
您需要对列名称进行一些操作(并删除NA
'spacer'列),但使用正确的XPath可以直接找到所需的表格:
library(rvest)
library(magrittr)
pg <- html("http://stats.espncricinfo.com/ci/engine/player/50710.html?class=2;template=results;type=batting;view=match")
pg %>%
html_nodes(xpath="//tr[@class='data1']/../..") %>% # get to a reasonable set of tables (there are many)
extract2(2) %>% # we want the second one
html_table(header=TRUE, trim=TRUE) -> data # there's a header and pls trim the blanks
str(data)
## data.frame': 397 obs. of 11 variables:
## $ Bat1 : chr "35" "85" "36*" "DNB" ...
## $ Runs : chr "35" "85" "36" "-" ...
## $ BF : chr "55" "116" "47" "-" ...
## $ SR : chr "63.63" "73.27" "76.59" "-" ...
## $ 4s : chr "4" "11" "3" "-" ...
## $ 6s : chr "0" "0" "0" "-" ...
## $ : logi NA NA NA NA NA NA ...
## $ Opposition: chr "v Pakistan" "v South Africa" "v Pakistan" "v South Africa" ...
## $ Ground : chr "Galle" "Galle" "Colombo (RPS)" "Colombo (SSC)" ...
## $ Start Date: chr "5 Jul 2000" "6 Jul 2000" "9 Jul 2000" "11 Jul 2000" ...
## $ : chr "ODI # 1603" "ODI # 1604" "ODI # 1608" "ODI # 1610" ...