我正试图从这个网站抓取数据Extra Skater
进入数据框。从我可以看到的HTML代码,有多个行类,您可以通过它们切换显示不同的表行。我只对带有标签的行感兴趣:
<tr class="team-game-stats team-game-stats-5v5close hidden">
例如:
<tr class="team-game-stats team-game-stats-5v5close hidden">
<td class="hidden">5v5close</td>
<td><a href="/game/2013-01-19-maple-leafs-canadiens">2013-01-19: Maple Leafs 2 at Canadiens 1</a></td>
<td class="number-right">19.7</td>
<td class="number-right">0</td>
<td class="number-right">0</td>
<td class="number-right">14</td>
<td class="number-right">18</td>
<td class="number-right">43.8%</td>
<td class="number-right">11</td>
<td class="number-right">15</td>
<td class="number-right">42.3%</td>
<td class="number-right">8</td>
<td class="number-right">11</td>
<td class="number-right">42.1%</td>
<td class="number-right">0.0%</td>
<td class="number-right">100.0%</td>
</tr>
当我运行代码时:
library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
tb = readHTMLTable(theurl)
它返回一个列表,其中所有表行一个堆叠在另一个上面。我想我必须使用xpathSApply来获得更高的精度,但我不确定路径参数。当我运行代码时:
library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE)
# Extract table header and contents
results <- xpathSApply(pagetree, "//*/table[@class='team-game-stats team-game-stats-5v5close hidden']/tr/td", xmlValue)
结果返回NULL。
感谢您的时间。
答案 0 :(得分:2)
试试这个:
xxpath = "//*[@class='team-game-stats team-game-stats-5v5close hidden']"
xpathApply(pagetree,xxpath,readHTMLList)
答案 1 :(得分:0)
你能过滤data.frame而不是HTML吗?
tb <- readHTMLTable(theurl, which=1)
table(tb$Situation)
5v5 5v5close 5v5tied all ev pp sh
48 48 48 48 48 48 48
subset(tb, Situation=="5v5close")