Question

使用我可信赖的firebug和firepath插件我试图抓取一些数据。

require(XML)

url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works

这个有效！ t现在包含"Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"

如果我试图捕获29.4的第一个截面时间：

t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work

t contains NULL.

任何想法我做错了什么？非常感谢。

Answer 1

首先，我找不到第一节时间为29.4。我在你链接的页面上看到的那个是24.5，或者我误解了你在找什么。

以下是使用library(rvest) html <- read_html(url) t <- html %>% html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>% html_text(trim = T) > t [1] "24.5"和SelectorGadget for Chrome抓取该内容的方法：

mt <- html %>%
    html_nodes("font > table font") %>%
    html_text(trim = T)  

> mt
 [1] "Meeting Date: 25/05/2008, Sha Tin"                      "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
 [3] "MONEY TALKS HANDICAP"                                   "Race\tTime :"                                           
 [5] "(24.5)"                                                 "(48.1)"                                                
 [7] "(1.10.3)"                                               "Sectional Time :"                                      
 [9] "24.5"                                                   "23.6"                                                  
[11] "22.2"                                                  
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"

这与您的方法略有不同，但我希望它有所帮助。不知道如何正确地刮掉会议时间，但这至少有效：

validates :name, uniqueness: true

Answer 2

看起来<a>之后的评论可能会让你失望。

<a name="Race1">
  <!-- test0 table start -->
  <table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
  <!--0 table End -->
  <!-- test1 table start -->
  <br>
  <br>
</a>

这似乎有效：

t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)

你可能想尝试一些不那么脆弱的东西，而不是那么长的直接路径。

<强>更新

如果你已经完成了＆＃34;第一节＆＃34;专栏：29.4,28.7等......

t <- xpathSApply(
  tree,
  "//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
  xmlValue
)

寻找＆＃34; 1st Sec。＆＃34;列，然后跳到它的行，抓住每一行的第一个td值。

[1] "29.4   "
[2] "28.7   "
[3] "29.2   "
[4] "29.0   "
[5] "29.3   "
[6] "28.2   "
[7] "29.5   "
[8] "29.5   "
[9] "30.1   "
[10] "29.8   "
[11] "29.6   "
[12] "29.9   "
[13] "29.1   "
[14] "29.8   "

我已删除了所有额外的空格（\ r \ n \ t \ t ... ...）以供显示。

如果你想让它变得更有活力，你可以在＆＃34; 1st Sec。＆＃34;下获取列值。或任何其他专栏。取代

/td[1]

与

td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]

使用它，您可以更新列的名称，并获取相应的值。对于所有＆＃34;第三节＆＃34;次：

"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"

[1] "23.3   "
[2] "23.7   "
[3] "23.3   "
[4] "23.8   "
[5] "23.7   "
[6] "24.5   "
[7] "24.1   "
[8] "24.0   "
[9] "24.1   "
[10] "24.1   "
[11] "23.9   "
[12] "23.9   "
[13] "24.3   "
[14] "24.0   "

xpathSApply webscrape返回NULL

2 个答案: