xpathSApply webscrape返回NULL

时间:2015-10-28 23:22:30

标签: r xpath

使用我可信赖的firebug和firepath插件我试图抓取一些数据。

require(XML)

url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works

这个有效! t现在包含"Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"

如果我试图捕获29.4的第一个截面时间:

t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work

t contains NULL.

任何想法我做错了什么?非常感谢。

2 个答案:

答案 0 :(得分:1)

首先,我找不到第一节时间为29.4。我在你链接的页面上看到的那个是24.5,或者我误解了你在找什么。

以下是使用library(rvest) html <- read_html(url) t <- html %>% html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>% html_text(trim = T) > t [1] "24.5" 和SelectorGadget for Chrome抓取该内容的方法:

mt <- html %>%
    html_nodes("font > table font") %>%
    html_text(trim = T)  

> mt
 [1] "Meeting Date: 25/05/2008, Sha Tin"                      "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
 [3] "MONEY TALKS HANDICAP"                                   "Race\tTime :"                                           
 [5] "(24.5)"                                                 "(48.1)"                                                
 [7] "(1.10.3)"                                               "Sectional Time :"                                      
 [9] "24.5"                                                   "23.6"                                                  
[11] "22.2"                                                  
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin" 

这与您的方法略有不同,但我希望它有所帮助。不知道如何正确地刮掉会议时间,但这至少有效:

validates :name, uniqueness: true

答案 1 :(得分:0)

看起来<a>之后的评论可能会让你失望。

<a name="Race1">
  <!-- test0 table start -->
  <table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
  <!--0 table End -->
  <!-- test1 table start -->
  <br>
  <br>
</a>

这似乎有效:

t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)

你可能想尝试一些不那么脆弱的东西,而不是那么长的直接路径。

<强>更新

如果你已经完成了&#34;第一节&#34;专栏:29.4,28.7等......

t <- xpathSApply(
  tree,
  "//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
  xmlValue
)

寻找&#34; 1st Sec。&#34;列,然后跳到它的行,抓住每一行的第一个td值。

[1] "29.4   "
[2] "28.7   "
[3] "29.2   "
[4] "29.0   "
[5] "29.3   "
[6] "28.2   "
[7] "29.5   "
[8] "29.5   "
[9] "30.1   "
[10] "29.8   "
[11] "29.6   "
[12] "29.9   "
[13] "29.1   "
[14] "29.8   "

我已删除了所有额外的空格(\ r \ n \ t \ t ... ...)以供显示。

如果你想让它变得更有活力,你可以在&#34; 1st Sec。&#34;下获取列值。或任何其他专栏。取代

/td[1]

td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]

使用它,您可以更新列的名称,并获取相应的值。对于所有&#34;第三节&#34;次:

"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"

[1] "23.3   "
[2] "23.7   "
[3] "23.3   "
[4] "23.8   "
[5] "23.7   "
[6] "24.5   "
[7] "24.1   "
[8] "24.0   "
[9] "24.1   "
[10] "24.1   "
[11] "23.9   "
[12] "23.9   "
[13] "24.3   "
[14] "24.0   "