Question

我正在尝试使用Google学术搜索提供的信息开发学术网络。部分内容涉及从弹出窗口中抓取数据（实际上不确定它是什么样的窗口 - 它似乎不是常规窗口或iframe）是通过单击文章标题来生成的。个人学者的页面。

我一直在使用RSelenium来执行此任务。以下是我目前为与Google学术搜索进行交互而开发的代码。

#Libraries----    
library(RSelenium)


#Functions----
#Convenience function for simplifying data generated from .$findElements()
unPack <- function(x, opt = "text"){
  unlist(sapply(x, function(x){x$getElementAttribute(opt)}))  
}


#Analysis----
#Start up the server for Chrome.
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "http://scholar.google.com/citations?user=sc3TX6oAAAAJ&hl=en&oi=ao"
#Open the site.
remDr$navigate(siteAdd)

#Create a list of all the article titles
cite100Elem <- remDr$findElements(using = "css selector", value = "a.gsc_a_at")
cite100 <- unPack(cite100Elem)

#Start scraping the first article. I will create some kind of loop for all
# articles later.
#This opens the pop-up window with additional data I'm interested in.
citeTitle <- cite100[1]
citeElem <- remDr$findElement(using = 'link text', value = citeTitle)
citeElem$clickElement()

在这里，我被卡住了。使用Chrome的开发者工具查看基础网页，我可以看到我感兴趣的第一部分信息，该文章的作者，与以下HTML相关联：

<div class="gsc_vcd_value">TR Moore, NT Roulet, JM Waddington</div>

这表明我应该能够做到这样的事情：

#Extract all the information about the article.
articleElem <- remDr$findElements(value = '//*[@class="gsc_vcd_title"]')
articleInfo <- unPack(articleElem)

然而，这个解决方案似乎并不起作用;它返回值＆＃34; NULL＆＃34;。

我希望有人在那里有一个基于R的解决方案，因为我对Java Script知之甚少。

最后，如果从以下代码中搜索结果文本（解析我当前正在访问的页面）：

htmlOut <- XML::htmlParse(remDr$getPageSource()[[1]])
htmlOut

我无法找到与＆＃34; gsc_vcd_title＆＃34;相关联的CSS类，这告诉我，我感兴趣的页面有一个更复杂的结构，我还没有非常想通了。

您的任何见解都会非常受欢迎。谢谢！

使用RSelenium网络刮取Google学术搜索

0 个答案: