Question

我用R. R抓住NFL网站可能不是最好的，但这不是我的问题。我通常可以得到我想要的一切，但这是我第一次遇到问题。在目前的情况下，我希望从这个页面

获取信息

http://www.nfl.com/player/j.j.watt/2495488/profile

我想得到的信息是

<a href="draft" onclick="s_objectID=&quot;http://www.nfl.com/player/j.j.watt/2495488/draft_1&quot;;return this.s_oc?this.s_oc(e):true">Draft</a>

使用xPathSapply（parsedPage，xmlGettAttr，name =＆＃34; onclick＆＃34;）我只得到NULL ...而且我没理由。

我可以在代码中的其他位置检索信息，然后粘贴以恢复地址，但我发现立即获取它更容易和更清晰。我如何使用R最终得到这个C.我对JavaScript知之甚少，我很乐意避免这种情况。

提前感谢您的帮助。

Answer 1

原因是源代码中没有＆＃34; onclick＆＃34; -attributes：请参阅（在Chrome中） view-source:http://www.nfl.com/player/j.j.watt/2495488/profile

通过javascript添加onclick属性。因此，你需要一个执行JS的解析器。

在 R 中，您可以RSelenium进行以下操作：

require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.nfl.com/player/j.j.watt/2495488/profile")
doc <- remDr$getPageSource()

require(rvest)
doc <- read_html(doc[[1]])
doc %>% html_nodes(".HOULink") %>% xml_attr("onclick")

remDr$close()
#shutdown
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")

对我来说，这导致了：

[1] "s_objectID=\"http://www.nfl.com/teams/houstontexans/profile?team=HOU_1\";return this.s_oc?this.s_oc(e):true"                 
[2] "s_objectID=\"http://www.houstontexans.com/_2\";return this.s_oc?this.s_oc(e):true"                                           
[3] "s_objectID=\"http://www.nfl.com/gamecenter/2015122004/2015/REG15/texans@colts/watch_1\";return this.s_oc?this.s_oc(e):true"  
...

您还可以使用像phantomjs这样的无头浏览器，请参阅https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html

在R中刮痧，无法获得＆＃34; onclick＆＃34;属性

1 个答案: