从弹出窗口中提取Web

时间:2015-04-24 07:34:11

标签: r web-scraping

我需要获取以下页面中列出的所有关注者的网络链接。

https://www.researchgate.net/topic/biotechnology

此时此主题有206770名粉丝。当我点击“查看全部”按钮时,会出现一个弹出窗口,其中会显示一个列表,并且随着我的记忆而不断扩展。

https://www.researchgate.net/profile/Kestutis_Sasnauskas ...

以上是热门追随者的链接。有没有办法可以获得所有206770粉丝的网页链接?

3 个答案:

答案 0 :(得分:1)

可以使用rvestRSelenium来完成此操作。后者基本上是需要的,前者会让你的生活更轻松。从github RSelenium安装devtools::install_github("ropensci/RSelenium")。来自cran的rvest

以下是完成所需内容所需的代码。

siteUrl <- "http://www.researchgate.net/"
GateUrl <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset="

library(rvest)
library(RSelenium)

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open(silent = FALSE)

i <- 0
profileUrls <- c()

for(j in 1:3){
  print(j)
  remDrv$navigate(paste0(GateUrl, i))
  l <- html(remDrv$getPageSource()[[1]])
  profileUrls <- c(profileUrls, 
               paste0(siteUrl, l %>% html_nodes(".display-name") %>% xml_attr("href")))
  i <- length(profileUrls)+1

}

remDrv$close()
profileUrls 

这里有几件事。你需要找出j循环。我认为它会为每个网址提取38个配置文件,因此j应该类似于for(j in 1:(followers/38))

第二点是代码在保存链接的方式上效率不高,即每次都附加代码。更好的解决方案是使用lapplyunlist

最后一点你需要在你的机器上使用mozilla firefox,因为这是RSelenium使用的默认设置,尽管你可以将其设置为使用你最常用的浏览器。

<强> 结果 从第56个

开始
> profileUrls
[1] "http://www.researchgate.net/profile/Jose_Carbajo2"           
[2] "http://www.researchgate.net/profile/Daniele_Riccio"          
[3] "http://www.researchgate.net/profile/Fiona_Togneri2"          
[4] "http://www.researchgate.net/profile/Sukanya_Patel"           
[5] "http://www.researchgate.net/profile/Neri_Fattorini"          
[6] "http://www.researchgate.net/profile/Pham_Thi_Thuy_Van"       
[7] "http://www.researchgate.net/profile/Kestutis_Sasnauskas"     
[8] "http://www.researchgate.net/profile/Iris_Weintal"            
[9] "http://www.researchgate.net/profile/Godelieve_Verhaegen"     
[10] "http://www.researchgate.net/profile/Janani_Venkatraman2"     
[11] "http://www.researchgate.net/profile/Kai_Wang126"             
[12] "http://www.researchgate.net/profile/Irine_Ronin"             
[13] "http://www.researchgate.net/profile/Natasha_Ikhsan"          
[14] "http://www.researchgate.net/profile/Nadya_Hajar"             
[15] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"    
[16] "http://www.researchgate.net/profile/Amsha_Viraragavan"       
[17] "http://www.researchgate.net/profile/Wei_Leiyan"              
[18] "http://www.researchgate.net/profile/Yosuke_Inada"            
[19] "http://www.researchgate.net/profile/Nadya_Hajar"             
[20] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"    
[21] "http://www.researchgate.net/profile/Amsha_Viraragavan"       
[22] "http://www.researchgate.net/profile/Wei_Leiyan"              
[23] "http://www.researchgate.net/profile/Yosuke_Inada"            
[24] "http://www.researchgate.net/profile/Yongning_You"            
[25] "http://www.researchgate.net/profile/Susan_Hu6"               
[26] "http://www.researchgate.net/profile/Matt_Evans11"            
[27] "http://www.researchgate.net/profile/Nam_Kieu"                
[28] "http://www.researchgate.net/profile/Nur_Musa3"               
[29] "http://www.researchgate.net/profile/Varaporn_S"              
[30] "http://www.researchgate.net/profile/Askar_Begzat3"           
[31] "http://www.researchgate.net/profile/Bing_Wang63"             
[32] "http://www.researchgate.net/profile/Xuebin_Yan"              
[33] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[34] "http://www.researchgate.net/profile/Stephen_Heimann"         
[35] "http://www.researchgate.net/profile/Hanina_Hanifa"           
[36] "http://www.researchgate.net/profile/Bo_Wang143"              
[37] "http://www.researchgate.net/profile/Xuebin_Yan"              
[38] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[39] "http://www.researchgate.net/profile/Stephen_Heimann"         
[40] "http://www.researchgate.net/profile/Hanina_Hanifa"           
[41] "http://www.researchgate.net/profile/Bo_Wang143"              
[42] "http://www.researchgate.net/profile/Huili_Li5"               
[43] "http://www.researchgate.net/profile/Giuseppe_Infusini"       
[44] "http://www.researchgate.net/profile/Carmen_Wacher"           
[45] "http://www.researchgate.net/profile/Linyn_Linyn"             
[46] "http://www.researchgate.net/profile/Dan_Youel"               
[47] "http://www.researchgate.net/profile/Catherine_Williams16"    
[48] "http://www.researchgate.net/profile/Nichole_Macaraeg"        
[49] "http://www.researchgate.net/profile/Peter_Oroszlan"          
[50] "http://www.researchgate.net/profile/Eduard_Karamov"          
[51] "http://www.researchgate.net/profile/Mauricio_Franco3"        
[52] "http://www.researchgate.net/profile/Patricia_Zancan"         
[53] "http://www.researchgate.net/profile/Rohana_Dassanayake"      
[54] "http://www.researchgate.net/profile/Khadija_Khataby"         
[55] "http://www.researchgate.net/profile/Imane_Moest"             
[56] "http://www.researchgate.net/profile/Rory_Adey"

答案 1 :(得分:0)

作为RSelenium的替代方案,您可以像这样尝试(前56位粉丝为例):

library(XML)
library(jsonlite)
offsets <- seq(from = 1, to = 50, 18)
urls <- sprintf("http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset=%d", offsets)

df <- data.frame()
for (x in seq_along(urls)) {
  doc <- htmlParse(urls[x])
  script <- as(doc[['//script[5]']], "character")
  splits <- strsplit(script, '\\(function\\(\\)\\{Y\\.rg\\.createInitialWidget\\("[^\"]+",')[[1]][-1]
  res <- lapply(splits, function(split) {
    split <-sub(");})();\n", "", split, fixed = TRUE)
    res <- try(as.data.frame(t(unlist(fromJSON(gsub("\\\\", "", split))))), silent = TRUE)
    if (!inherits(res, "try-error")) return(res) else return(NULL)
  })
  df <- rbind(df, do.call(rbind, res[1:(length(res)-2)]))
}
dplyr::glimpse(df)
# Observations: 56
# Variables:
#   $ _isReact                                                         (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.displayName                                                 (fctr) Jose Maria Carbajo, Daniele Riccio, Fiona S Togneri, Sukanya Paramashivaiah Patel, Neri Fattorini, Pham thi thuy van, Kestutis Sasnauskas, Iris Weintal, Godelieve Verhaegen, Ja...
# $ data.profile.professionalInstitution.professionalInstitutionName (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.profile.professionalInstitution.professionalInstitutionUrl  (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.professionalInstitutionName                                 (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.professionalInstitutionUrl                                  (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.url                                                         (fctr) profile/Jose_Carbajo2, profile/Daniele_Riccio, profile/Fiona_Togneri2, profile/Sukanya_Patel, profile/Neri_Fattorini, profile/Pham_Thi_Thuy_Van, profile/Kestutis_Sasnauskas, pr...
# $ data.imageUrl                                                    (fctr) http://c1.rgstatic.net/m/797670414832/images/template/default/profile/profile_default_m.jpg, http://i1.rgstatic.net/i/profile/54a1a5539f8e2f289f_m_25d91.jpg, http://i1.rgstatic...
# $ data.imageSize                                                   (fctr) m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m
# $ data.imageHeight                                                 (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.imageWidth                                                  (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.enableFollowButton                                          (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.enableHideButton                                            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.enableConnectionButton                                      (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.isClaimedAuthor                                             (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.hasExtraContainer                                           (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showStatsWidgets                                            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showHideButton                                              (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.accountKey                                                  (fctr) Jose_Carbajo2, Daniele_Riccio, Fiona_Togneri2, Sukanya_Patel, Neri_Fattorini, Pham_Thi_Thuy_Van, Kestutis_Sasnauskas, Iris_Weintal, Godelieve_Verhaegen, Janani_Venkatraman2, Ka...
# $ data.hasInfoPopup                                                (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.hasTeaserPopup                                              (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.widgetId                                                    (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ id                                                               (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ templateName                                                     (fctr) application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, a...
# $ templateExtensions                                               (fctr) generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, ...
# $ widgetUrl                                                        (fctr) http://www.researchgate.net/application.PeopleAccountItem.html?entityId=7508014&imageSize=m&enableFollowButton=1&showHideButton=0&showConnectionButton=0&event=tp_followers_xflw...
# $ viewClass                                                        (fctr) views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views....
# $ yuiModules                                                       (fctr) rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleI...

答案 2 :(得分:0)

如果您要求,服务器会将数据作为JSON返回。后续调用使用先前JSON调用提供的偏移参数。在下面的例子中,我刚刚调用了前10个偏移量。这相当于向下滚动10次。只有配置文件网站链接有更多数据:

library(RCurl)
library(XML)
library(jsonlite)
# get initial page
initURL <- "http://www.researchgate.net/topic/biotechnology"
doc <- htmlParse(initURL)
noFollowers <- doc["//*/strong/*/a[@class='js-see-all']", fun = xmlValue][[1]]
noFollowers <- as.integer(gsub("[^0-9]", "", noFollowers))

appURL <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000"
appData <- getURL(appURL
                  , httpheader = c(accept = "application/json"))
follData <- list(fromJSON(appData)$result$data$content$data$listItems)
for(i in 1:10){
  nextURL <- fromJSON(appData)$result$data$nextOffset
  appData <- getURL(paste0(appURL, "&offset=", nextURL)
                    , httpheader = c(accept = "application/json"))
  follData[[i+1]] <- fromJSON(appData)$result$data$content$data$listItems
}
followers <- na.omit(do.call(c, lapply(follData, function(x){x$data$url})))
> head(followers)
[1] "profile/Subhashish_Dutta" "profile/Jerome_Wang3"     "profile/Jose_Carbajo2"   
[4] "profile/Daniele_Riccio"   "profile/Fiona_Togneri2"   "profile/Sukanya_Patel"