l_ply困惑关于如何将变量传递给函数

时间:2015-03-29 16:09:52

标签: r plyr

我确定必须在某处回答,所以;如果你有一个指向答案的指针,请告诉我...; o)

我有许多相当大的处理任务(主要是多标签文本分类器),它们读取大量文件,用它做的东西,输出结果然后移到下一个。

我整齐地按顺序工作但想要并行化。

通过一个非常基本的例子......

require(plyr)
fileDir   <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))

l_ply(files, function(x){
                          print(x)

                          #change to dir containing source files
                          setwd(fileDir)

                          # read file
                          content <- read.csv(file=x,header=TRUE)

                          # change directory to output
                          setwd(outputDir)

                          # append the itemID from CSV file to 
                          write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE) 

              }, .parallel=FALSE )

将遍历目录fileDir中的所有文件,打开每个CSV,从文件中提取值并将其附加到目录outputDir中保存的输出CSV。一个基本的例子,但运行得很好,以说明问题。

要并行运行此操作会产生一个问题,因为目录变量(fileDir&amp; outputDir)基本上不为匿名function (x)所知,ala ...

require(plyr)
require(doParallel)
fileDir   <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))

cl<-makeCluster(4)      # make a cluster of available cores
registerDoParallel(cl)  # raise cluster

l_ply(files, function(x){
              print(x)

              #change to dir containing source files
              #setwd(fileDir)

              # read file
              content <- read.csv(file=x,header=TRUE)

              # change directory to output
              setwd(y)

              # append the itemID from CSV file to 
              write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE) 

}, .parallel=TRUE )

stopCluster()  # kill the cluster

有人能说明我如何将这两个目录变量传递给函数吗?

1 个答案:

答案 0 :(得分:0)

感谢@Roland我的并行功能现在......

require(plyr)
require(doParallel)
fileDir   <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))

cl<-makeCluster(4)      # make a cluster of available cores
registerDoParallel(cl)  # raise cluster

l_ply(files, function(x,y,z){
              filename  <- x
              fileDir   <- y
              outputDir <- z

              #change to dir containing source files
              setwd(fileDir)

              # read file
              content <- read.csv(file=filename,header=TRUE)

              # change directory to output
              setwd(outputDir)

              # append the itemID from CSV file to 
              write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE) 

}, y=fileDir, z=outputDir, .parallel=TRUE )

stopCluster()  # kill the cluster