我是R的新手。我正在尝试使用getURIAsynchronous()
包中的RCurl
函数抓取多个https网页。但是,对于每个url,函数返回“”作为结果。
我尝试使用同一个包中的url.exists()
函数来查看它是返回TRUE还是FALSE。令我惊讶的是,它将值返回为FALSE。但网址存在。
由于我使用的这些https网址是我公司特定的网址,因此出于保密原因,我无法提供示例。但是,使用readLines()
成功从网站中提取所有html内容。但对于成千上万的网址来说,这是缓慢且耗时的。知道为什么getURIAsynchronous()
返回“”而不是抓取HTML内容?我的重点是只扫描整个html内容,我可以自己解析数据。
是否还有其他软件包可以帮助我更快地抓取多个https网站,而不是一次只做一个页面?
更新:下面是一个类似于我一直试图做的小例子。在这种情况下,它只是几个网址,但在我的项目中,我有几千个。当我尝试使用下面的类似代码提取文本时,我会为所有网址获取“”。
库(RCurl)
source_url< - c(“https://cran.r-project.org/web/packages/RCurl/index.html”,“https://cran.r-project.org/web/packages/rvest/index.html”)
multi_urls< - getURIAsynchronous(source_url) multi_urls< - as.list(multi_urls)
答案 0 :(得分:1)
我不知道您试图从哪个特定网址中删除,但下面的代码将演示如何遍历多个网址,并从每个网址中抓取数据。也许您可以利用此代码来实现您的特定目标。
library(rvest)
library(stringr)
#create a master dataframe to store all of the results
complete <- data.frame()
yearsVector <- c("2010", "2011", "2012", "2013", "2014", "2015")
#position is not needed since all of the info is stored on the page
#positionVector <- c("qb", "rb", "wr", "te", "ol", "dl", "lb", "cb", "s")
positionVector <- c("qb")
for (i in 1:length(yearsVector)) {
for (j in 1:length(positionVector)) {
# create a url template
URL.base <- "http://www.nfl.com/draft/"
URL.intermediate <- "/tracker?icampaign=draft-sub_nav_bar-drafteventpage-tracker#dt-tabs:dt-by-position/dt-by-position-input:"
#create the dataframe with the dynamic values
URL <- paste0(URL.base, yearsVector[i], URL.intermediate, positionVector[j])
#print(URL)
#read the page - store the page to make debugging easier
page <- read_html(URL)
#find records for each player
playersloc <- str_locate_all(page, "\\{\"personId.*?\\}")[[1]]
# Select the first column [, 1] and select the second column [, 2]
players <- str_sub(page, playersloc[, 1] + 1, playersloc[, 2] - 1)
#fix the cases where the players are named Jr.
players <- gsub(", ", "_", players)
#split and reshape the data in a data frame
play2 <- strsplit(gsub("\"", "", players), ',')
data <- sapply(strsplit(unlist(play2), ":"), FUN = function(x) { x[2] })
df <- data.frame(matrix(data, ncol = 16, byrow = TRUE))
#name the column names
names(df) <- sapply(strsplit(unlist(play2[1]), ":"), FUN = function(x) { x[1] })
#store the temp values into the master dataframe
complete <- rbind(complete, df)
}
}
另外。 。 。
library(rvest)
library(stringr)
library(tidyr)
site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0'
webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
'&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
'&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
'&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
'&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
'&order_by_asc=&offset=', jump, sep="")
dfList <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
})
finaldf <- do.call(rbind, dfList)