我需要获取以下页面中列出的所有关注者的网络链接。
https://www.researchgate.net/topic/biotechnology
此时此主题有206770名粉丝。当我点击“查看全部”按钮时,会出现一个弹出窗口,其中会显示一个列表,并且随着我的记忆而不断扩展。
https://www.researchgate.net/profile/Kestutis_Sasnauskas ...
以上是热门追随者的链接。有没有办法可以获得所有206770粉丝的网页链接?
答案 0 :(得分:1)
可以使用rvest
和RSelenium
来完成此操作。后者基本上是需要的,前者会让你的生活更轻松。从github RSelenium
安装devtools::install_github("ropensci/RSelenium")
。来自cran的rvest
。
以下是完成所需内容所需的代码。
siteUrl <- "http://www.researchgate.net/"
GateUrl <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset="
library(rvest)
library(RSelenium)
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open(silent = FALSE)
i <- 0
profileUrls <- c()
for(j in 1:3){
print(j)
remDrv$navigate(paste0(GateUrl, i))
l <- html(remDrv$getPageSource()[[1]])
profileUrls <- c(profileUrls,
paste0(siteUrl, l %>% html_nodes(".display-name") %>% xml_attr("href")))
i <- length(profileUrls)+1
}
remDrv$close()
profileUrls
这里有几件事。你需要找出j
循环。我认为它会为每个网址提取38个配置文件,因此j
应该类似于for(j in 1:(followers/38))
。
第二点是代码在保存链接的方式上效率不高,即每次都附加代码。更好的解决方案是使用lapply
和unlist
。
最后一点你需要在你的机器上使用mozilla firefox,因为这是RSelenium
使用的默认设置,尽管你可以将其设置为使用你最常用的浏览器。
<强> 结果 强> 从第56个
开始> profileUrls
[1] "http://www.researchgate.net/profile/Jose_Carbajo2"
[2] "http://www.researchgate.net/profile/Daniele_Riccio"
[3] "http://www.researchgate.net/profile/Fiona_Togneri2"
[4] "http://www.researchgate.net/profile/Sukanya_Patel"
[5] "http://www.researchgate.net/profile/Neri_Fattorini"
[6] "http://www.researchgate.net/profile/Pham_Thi_Thuy_Van"
[7] "http://www.researchgate.net/profile/Kestutis_Sasnauskas"
[8] "http://www.researchgate.net/profile/Iris_Weintal"
[9] "http://www.researchgate.net/profile/Godelieve_Verhaegen"
[10] "http://www.researchgate.net/profile/Janani_Venkatraman2"
[11] "http://www.researchgate.net/profile/Kai_Wang126"
[12] "http://www.researchgate.net/profile/Irine_Ronin"
[13] "http://www.researchgate.net/profile/Natasha_Ikhsan"
[14] "http://www.researchgate.net/profile/Nadya_Hajar"
[15] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"
[16] "http://www.researchgate.net/profile/Amsha_Viraragavan"
[17] "http://www.researchgate.net/profile/Wei_Leiyan"
[18] "http://www.researchgate.net/profile/Yosuke_Inada"
[19] "http://www.researchgate.net/profile/Nadya_Hajar"
[20] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"
[21] "http://www.researchgate.net/profile/Amsha_Viraragavan"
[22] "http://www.researchgate.net/profile/Wei_Leiyan"
[23] "http://www.researchgate.net/profile/Yosuke_Inada"
[24] "http://www.researchgate.net/profile/Yongning_You"
[25] "http://www.researchgate.net/profile/Susan_Hu6"
[26] "http://www.researchgate.net/profile/Matt_Evans11"
[27] "http://www.researchgate.net/profile/Nam_Kieu"
[28] "http://www.researchgate.net/profile/Nur_Musa3"
[29] "http://www.researchgate.net/profile/Varaporn_S"
[30] "http://www.researchgate.net/profile/Askar_Begzat3"
[31] "http://www.researchgate.net/profile/Bing_Wang63"
[32] "http://www.researchgate.net/profile/Xuebin_Yan"
[33] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[34] "http://www.researchgate.net/profile/Stephen_Heimann"
[35] "http://www.researchgate.net/profile/Hanina_Hanifa"
[36] "http://www.researchgate.net/profile/Bo_Wang143"
[37] "http://www.researchgate.net/profile/Xuebin_Yan"
[38] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[39] "http://www.researchgate.net/profile/Stephen_Heimann"
[40] "http://www.researchgate.net/profile/Hanina_Hanifa"
[41] "http://www.researchgate.net/profile/Bo_Wang143"
[42] "http://www.researchgate.net/profile/Huili_Li5"
[43] "http://www.researchgate.net/profile/Giuseppe_Infusini"
[44] "http://www.researchgate.net/profile/Carmen_Wacher"
[45] "http://www.researchgate.net/profile/Linyn_Linyn"
[46] "http://www.researchgate.net/profile/Dan_Youel"
[47] "http://www.researchgate.net/profile/Catherine_Williams16"
[48] "http://www.researchgate.net/profile/Nichole_Macaraeg"
[49] "http://www.researchgate.net/profile/Peter_Oroszlan"
[50] "http://www.researchgate.net/profile/Eduard_Karamov"
[51] "http://www.researchgate.net/profile/Mauricio_Franco3"
[52] "http://www.researchgate.net/profile/Patricia_Zancan"
[53] "http://www.researchgate.net/profile/Rohana_Dassanayake"
[54] "http://www.researchgate.net/profile/Khadija_Khataby"
[55] "http://www.researchgate.net/profile/Imane_Moest"
[56] "http://www.researchgate.net/profile/Rory_Adey"
答案 1 :(得分:0)
作为RSelenium
的替代方案,您可以像这样尝试(前56位粉丝为例):
library(XML)
library(jsonlite)
offsets <- seq(from = 1, to = 50, 18)
urls <- sprintf("http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset=%d", offsets)
df <- data.frame()
for (x in seq_along(urls)) {
doc <- htmlParse(urls[x])
script <- as(doc[['//script[5]']], "character")
splits <- strsplit(script, '\\(function\\(\\)\\{Y\\.rg\\.createInitialWidget\\("[^\"]+",')[[1]][-1]
res <- lapply(splits, function(split) {
split <-sub(");})();\n", "", split, fixed = TRUE)
res <- try(as.data.frame(t(unlist(fromJSON(gsub("\\\\", "", split))))), silent = TRUE)
if (!inherits(res, "try-error")) return(res) else return(NULL)
})
df <- rbind(df, do.call(rbind, res[1:(length(res)-2)]))
}
dplyr::glimpse(df)
# Observations: 56
# Variables:
# $ _isReact (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.displayName (fctr) Jose Maria Carbajo, Daniele Riccio, Fiona S Togneri, Sukanya Paramashivaiah Patel, Neri Fattorini, Pham thi thuy van, Kestutis Sasnauskas, Iris Weintal, Godelieve Verhaegen, Ja...
# $ data.profile.professionalInstitution.professionalInstitutionName (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.profile.professionalInstitution.professionalInstitutionUrl (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.professionalInstitutionName (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.professionalInstitutionUrl (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.url (fctr) profile/Jose_Carbajo2, profile/Daniele_Riccio, profile/Fiona_Togneri2, profile/Sukanya_Patel, profile/Neri_Fattorini, profile/Pham_Thi_Thuy_Van, profile/Kestutis_Sasnauskas, pr...
# $ data.imageUrl (fctr) http://c1.rgstatic.net/m/797670414832/images/template/default/profile/profile_default_m.jpg, http://i1.rgstatic.net/i/profile/54a1a5539f8e2f289f_m_25d91.jpg, http://i1.rgstatic...
# $ data.imageSize (fctr) m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m
# $ data.imageHeight (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.imageWidth (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.enableFollowButton (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.enableHideButton (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.enableConnectionButton (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.isClaimedAuthor (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.hasExtraContainer (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showStatsWidgets (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showHideButton (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.accountKey (fctr) Jose_Carbajo2, Daniele_Riccio, Fiona_Togneri2, Sukanya_Patel, Neri_Fattorini, Pham_Thi_Thuy_Van, Kestutis_Sasnauskas, Iris_Weintal, Godelieve_Verhaegen, Janani_Venkatraman2, Ka...
# $ data.hasInfoPopup (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.hasTeaserPopup (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.widgetId (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ id (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ templateName (fctr) application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, a...
# $ templateExtensions (fctr) generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, ...
# $ widgetUrl (fctr) http://www.researchgate.net/application.PeopleAccountItem.html?entityId=7508014&imageSize=m&enableFollowButton=1&showHideButton=0&showConnectionButton=0&event=tp_followers_xflw...
# $ viewClass (fctr) views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views....
# $ yuiModules (fctr) rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleI...
答案 2 :(得分:0)
如果您要求,服务器会将数据作为JSON返回。后续调用使用先前JSON调用提供的偏移参数。在下面的例子中,我刚刚调用了前10个偏移量。这相当于向下滚动10次。只有配置文件网站链接有更多数据:
library(RCurl)
library(XML)
library(jsonlite)
# get initial page
initURL <- "http://www.researchgate.net/topic/biotechnology"
doc <- htmlParse(initURL)
noFollowers <- doc["//*/strong/*/a[@class='js-see-all']", fun = xmlValue][[1]]
noFollowers <- as.integer(gsub("[^0-9]", "", noFollowers))
appURL <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000"
appData <- getURL(appURL
, httpheader = c(accept = "application/json"))
follData <- list(fromJSON(appData)$result$data$content$data$listItems)
for(i in 1:10){
nextURL <- fromJSON(appData)$result$data$nextOffset
appData <- getURL(paste0(appURL, "&offset=", nextURL)
, httpheader = c(accept = "application/json"))
follData[[i+1]] <- fromJSON(appData)$result$data$content$data$listItems
}
followers <- na.omit(do.call(c, lapply(follData, function(x){x$data$url})))
> head(followers)
[1] "profile/Subhashish_Dutta" "profile/Jerome_Wang3" "profile/Jose_Carbajo2"
[4] "profile/Daniele_Riccio" "profile/Fiona_Togneri2" "profile/Sukanya_Patel"