使用我可信赖的firebug和firepath插件我试图抓取一些数据。
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
这个有效! t现在包含"Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
如果我试图捕获29.4的第一个截面时间:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
任何想法我做错了什么?非常感谢。
答案 0 :(得分:1)
首先,我找不到第一节时间为29.4。我在你链接的页面上看到的那个是24.5,或者我误解了你在找什么。
以下是使用library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
和SelectorGadget for Chrome抓取该内容的方法:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"
这与您的方法略有不同,但我希望它有所帮助。不知道如何正确地刮掉会议时间,但这至少有效:
validates :name, uniqueness: true
答案 1 :(得分:0)
看起来<a>
之后的评论可能会让你失望。
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
这似乎有效:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
你可能想尝试一些不那么脆弱的东西,而不是那么长的直接路径。
<强>更新强>
如果你已经完成了&#34;第一节&#34;专栏:29.4,28.7等......
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
寻找&#34; 1st Sec。&#34;列,然后跳到它的行,抓住每一行的第一个td值。
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
我已删除了所有额外的空格(\ r \ n \ t \ t ... ...)以供显示。
如果你想让它变得更有活力,你可以在&#34; 1st Sec。&#34;下获取列值。或任何其他专栏。取代
/td[1]
与
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
使用它,您可以更新列的名称,并获取相应的值。对于所有&#34;第三节&#34;次:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "