加载CRAN包以与JD Long的emrlapply()一起使用' segue'包?

时间:2014-06-25 00:16:53

标签: r amazon-ec2 segue emr

我正在使用JD Long的segue包(https://code.google.com/p/segue/)来进行并行计算,并且遇到了在EC2实例上加载CRAN包的问题。

首先,我创建了一个EMR集群,如下所示:

myCluster <- createCluster(numInstances = 5, 
                           cranPackages = c("RWeka", "tm"), 
                           masterInstanceType="m1.large", 
                           slaveInstanceType="m1.large", 
                           location="us-east-1c",)

根据文档,我指定了要加载的软件包(在本例中为RWekatm)。

群集似乎正常启动,没有错误消息。我在Linux Mint 17上使用RStudio和R版本3.0.2。

我编写了一个函数getTerms.jobAd,它接受​​一个字符串并调用上面的包中的一些函数,并使用emrlapply(),如下所示:

> jobAdTerms <- emrlapply(myCluster, X = as.list(jobAds[1:2, 3]), FUN = getTerms.jobAd)
RUNNING - 2014-06-24 17:05:19
RUNNING - 2014-06-24 17:05:50
WAITING - 2014-06-24 17:06:20

当我检查应该返回的jobAdTerms列表时,我得到:

> jobAdTerms
[[1]]
[1] "error caught by Segue: Error in function (txt)  : could not find function \"Corpus\"\n"

[[2]]
[1] "error caught by Segue: Error in function (txt)  : could not find function \"Corpus\"\n"

显然,Corpustm包中的功能之一。

我做错了什么?我该如何解决这种情况呢?谢谢!

修改

这是我打电话的功能:

nGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4))

getTerms.jobAd <- function(txt) {

  tmp <- tolower(txt)
  tmp <- gsub('\\s*<.*?>|[:;,@$%^&*()?]|(?<=[a-zA-Z])\\.(?= |$)', '', tmp, perl = TRUE)

  txt.Corpus <- Corpus(VectorSource(tmp))
  txt.Corpus <- tm_map(txt.Corpus, stripWhitespace)

  txt.TFV <- termFreq(txt.Corpus[[1]], control = list(dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
  txt.TFV2 <- termFreq(txt.Corpus[[1]], control = list(tokenize = nGramTokenizer, dictionary = jobTags[, 1], wordLengths = c(1, Inf)))

  jobTerms <- rowSums(as.matrix(c(txt.TFV, txt.TFV2)))
  return(jobTerms)
}

编辑2

以下是重现错误的方法:

data(crude)
jobAdTerms <- emrlapply(myCluster, X = as.list(crude), FUN = getTerms.jobAd)

0 个答案:

没有答案