R中的XML包 - readHTMLTable和多行类

时间:2014-02-03 08:41:50

标签: xml r xpath

我正试图从这个网站抓取数据Extra Skater

进入数据框。从我可以看到的HTML代码,有多个行类,您可以通过它们切换显示不同的表行。我只对带有标签的行感兴趣:

<tr class="team-game-stats team-game-stats-5v5close hidden">

例如:

<tr class="team-game-stats team-game-stats-5v5close hidden">
    <td class="hidden">5v5close</td>

    <td><a href="/game/2013-01-19-maple-leafs-canadiens">2013-01-19: Maple Leafs 2 at Canadiens 1</a></td>

    <td class="number-right">19.7</td>
    <td class="number-right">0</td>
    <td class="number-right">0</td>
    <td class="number-right">14</td>    
    <td class="number-right">18</td>
    <td class="number-right">43.8%</td>
    <td class="number-right">11</td>
    <td class="number-right">15</td>
    <td class="number-right">42.3%</td>
    <td class="number-right">8</td>
    <td class="number-right">11</td>
    <td class="number-right">42.1%</td>
    <td class="number-right">0.0%</td>
    <td class="number-right">100.0%</td>

</tr>

当我运行代码时:

library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
tb = readHTMLTable(theurl)

它返回一个列表,其中所有表行一个堆叠在另一个上面。我想我必须使用xpathSApply来获得更高的精度,但我不确定路径参数。当我运行代码时:

library(RCurl)
library(XML)

theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE)

# Extract table header and contents
results <- xpathSApply(pagetree, "//*/table[@class='team-game-stats team-game-stats-5v5close hidden']/tr/td", xmlValue)

结果返回NULL。

感谢您的时间。

2 个答案:

答案 0 :(得分:2)

试试这个:

xxpath = "//*[@class='team-game-stats team-game-stats-5v5close hidden']"
xpathApply(pagetree,xxpath,readHTMLList)

答案 1 :(得分:0)

你能过滤data.frame而不是HTML吗?

tb <- readHTMLTable(theurl, which=1)
table(tb$Situation)
     5v5 5v5close  5v5tied      all       ev       pp       sh 
      48       48       48       48       48       48       48 
subset(tb, Situation=="5v5close")