我正在查询Freebase以获取大约10000部电影的类型信息。
阅读How to optimise scraping with getURL() in R后,我尝试并行执行请求。但是,我失败了 - 见下文。除了并行化之外,我还读到httr
可能是RCurl
的更好替代方案。
我的问题是:
是否可以通过使用来加速API调用
a)下面循环的并行版本(使用WINDOWS机器)?
b)getURL的替代方案,例如GET
- 包中的httr
library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)
df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)
f_query_freebase <- function(film.title){
request <- paste0("https://www.googleapis.com/freebase/v1/search?",
"filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
"&indent=TRUE",
"&limit=1",
"&output=(/film/film/genre)")
temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
data <- fromJSON(temp, simplifyVector=FALSE)
genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
return(genre)
}
# Non-parallel version
# ----------------------------------
for (i in df$film){
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
# Parallel version - Does not work
# ----------------------------------
# Set up parallel computing
cl<-makeCluster(2)
registerDoSNOW(cl)
foreach(i=df$film) %dopar% {
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
stopCluster(cl)
# --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".
答案 0 :(得分:1)
这不能在单个 R 会话中实现并行请求,但是,我曾经用它来实现跨多个的 >1 个同时请求(例如并行) > R 会话,所以它可能有用。
您需要将流程分成几个部分:
注意:这恰好在 Windows 上运行,所以我使用了 powershell。在 Mac 上,这可以用 bash 编写。
使用单个 powershell 脚本启动多个实例 R 进程(这里我们将工作划分为 3 个进程):
例如保存一个带有 .ps1
文件扩展名的纯文本文件,您可以双击它运行它,或者使用任务调度程序/cron 安排它:
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }
它在做什么?它会:
extract.R
的脚本,并为 R 脚本提供一个参数(1
、2
和 3
)。< /li>
每个 R 进程看起来像这样
# Get command line argument
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])
api_calls <- read.csv("api_calls.csv")
# work out which API calls each R script should make (e.g.
indicies <- seq(process_number, nrow(api_calls), 3)
api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)
# Now, make API calls as usual using rvest/jsonlite or whatever you use for that