我正在使用JD Long的segue
包(https://code.google.com/p/segue/)来进行并行计算,并且遇到了在EC2实例上加载CRAN
包的问题。
首先,我创建了一个EMR集群,如下所示:
myCluster <- createCluster(numInstances = 5,
cranPackages = c("RWeka", "tm"),
masterInstanceType="m1.large",
slaveInstanceType="m1.large",
location="us-east-1c",)
根据文档,我指定了要加载的软件包(在本例中为RWeka
和tm
)。
群集似乎正常启动,没有错误消息。我在Linux Mint 17上使用RStudio和R版本3.0.2。
我编写了一个函数getTerms.jobAd
,它接受一个字符串并调用上面的包中的一些函数,并使用emrlapply()
,如下所示:
> jobAdTerms <- emrlapply(myCluster, X = as.list(jobAds[1:2, 3]), FUN = getTerms.jobAd)
RUNNING - 2014-06-24 17:05:19
RUNNING - 2014-06-24 17:05:50
WAITING - 2014-06-24 17:06:20
当我检查应该返回的jobAdTerms
列表时,我得到:
> jobAdTerms
[[1]]
[1] "error caught by Segue: Error in function (txt) : could not find function \"Corpus\"\n"
[[2]]
[1] "error caught by Segue: Error in function (txt) : could not find function \"Corpus\"\n"
显然,Corpus
是tm
包中的功能之一。
我做错了什么?我该如何解决这种情况呢?谢谢!
修改
这是我打电话的功能:
nGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4))
getTerms.jobAd <- function(txt) {
tmp <- tolower(txt)
tmp <- gsub('\\s*<.*?>|[:;,@$%^&*()?]|(?<=[a-zA-Z])\\.(?= |$)', '', tmp, perl = TRUE)
txt.Corpus <- Corpus(VectorSource(tmp))
txt.Corpus <- tm_map(txt.Corpus, stripWhitespace)
txt.TFV <- termFreq(txt.Corpus[[1]], control = list(dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
txt.TFV2 <- termFreq(txt.Corpus[[1]], control = list(tokenize = nGramTokenizer, dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
jobTerms <- rowSums(as.matrix(c(txt.TFV, txt.TFV2)))
return(jobTerms)
}
编辑2
以下是重现错误的方法:
data(crude)
jobAdTerms <- emrlapply(myCluster, X = as.list(crude), FUN = getTerms.jobAd)