Question

我正在为大学做一个涉及网络抓取的项目。我正在尝试获取此网站（http://www.atpworldtour.com/en/rankings/singles?rankDate=2015-11-02&rankRange=1-5001）中玩家个人资料的所有链接。我试图用以下代码获取链接：

library(XML)
doc_parsed<-htmlTreeParse("ranking.html",useInternal =T)
root<-xmlRoot(doc_parsed)
hrefs1 = xpathSApply(root,fun=xmlGetAttr,"href",path='//a')

“ranking.html”是已保存的链接。当我运行代码时，它给了我一个6887的列表，而不是玩家个人资料的5000个链接。我该怎么办？

Answer 1

要缩小到您想要的链接，您必须在表达式中包含您所追求的元素所特有的属性。最好和最快的方法是使用id（应该是唯一的）。接下来最好的是使用具有特定类的元素下的路径。例如：

hrefs1 <- xpathSApply(root,fun=xmlGetAttr, "href", path='//td[@class="player-cell"]/a')

顺便说一下，你链接到的页面目前只有2252个链接，而不是5000个。

使用xpathSApply使用R进行Web Scraping并尝试仅使用“/ overview”获取链接

1 个答案: