我在R编程领域有点新鲜,而且我处理的是与(大部分)大数据处理并行化相关的一些问题。
为此,我使用data.table包进行数据存储和处理,并将snowfall包作为包装来并行化工作。
我提出了一个特定的案例:我有一个大的元素向量,我想在每个元素上应用函数f(我使用向量化版本);然后我将大矢量平衡为N个部分(较小的矢量),如下所示:
sfInit(parallel = TRUE, cpus = ncpus)
balancedVector <-myVectorLoadBalanceFunction(myLargeVector, ncpus)
processedSubVectors <- sfLapply(balancedVector, function(subVector) {
myVectorizedFunction(subVector)
})
sfStop()
我看到奇怪的是,当我从命令行或脚本运行这段代码(即largeVector在全局环境中)时,性能在时间方面是好的,我在MS Windows任务管理器中看到每个核心似乎使用与subVector大小成比例的内存量;但是当我在一个函数环境中运行代码(即从命令行调用它并将largeVector作为参数传递)时,性能在时间上变得更糟,我检查每个核心现在似乎正在使用的完整副本largeVector ...
这有意义吗?
此致
已编辑以添加可重复的示例
为了简单起见,这是一个虚拟示例,日期向量为~300 MB,带有+36 M元素和工作日函数:
library(snowfall)
aSomewhatLargeVector <- seq.Date(from = as.Date("1900-01-01"), to = as.Date("2000-01-01"), by = 1)
aSomewhatLargeVector <- rep(aSomewhatLargeVector, 1000)
# Sequential version to compare
system.time(processedSubVectorsSequential <- weekdays(aSomewhatLargeVector))
# user system elapsed
# 108.05 1.06 109.53
gc() # I restarted R
# Parallel version within a function scope
myCallingFunction = function(aSomewhatLargeVector) {
sfInit(parallel = TRUE, cpus = 2)
balancedVector <- list(aSomewhatLargeVector[seq(1, length(aSomewhatLargeVector)/2)],
aSomewhatLargeVector[seq(length(aSomewhatLargeVector)/2+1, length(aSomewhatLargeVector))])
processedSubVectorsParallelFunction <- sfLapply(balancedVector, function(subVector) {
weekdays(subVector)
})
sfStop()
processedSubVectorsParallelFunction <- unlist(processedSubVectorsParallelFunction)
return(processedSubVectorsParallelFunction)
}
system.time(processedSubVectorsParallelFunction <- myCallingFunction(aSomewhatLargeVector))
# user system elapsed
# 11.63 10.61 94.27
# user system elapsed
# 12.12 9.09 99.07
gc() # I restarted R
# Parallel version within the global scope
time0 <- proc.time()
sfInit(parallel = TRUE, cpus = 2)
balancedVector <- list(aSomewhatLargeVector[seq(1, length(aSomewhatLargeVector)/2)],
aSomewhatLargeVector[seq(length(aSomewhatLargeVector)/2+1, length(aSomewhatLargeVector))])
processedSubVectorsParallel <- sfLapply(balancedVector, function(subVector) {
weekdays(subVector)
})
sfStop()
processedSubVectorsParallel <- unlist(processedSubVectorsParallel)
time1 <- proc.time()
time1-time0
# user system elapsed
# 7.94 4.75 85.14
# user system elapsed
# 9.92 3.93 89.69
我的时间出现在评论中,尽管这个虚拟示例没有那么显着的差异,但可以看出顺序时间&gt;并且在函数内并行&gt;在全球范围内并行
此外,您可以看到已分配内存的差异:
3.3 GB&lt; 5.2 GB&gt; 4.4 GB
希望这有帮助