R数据处理性能(降雪包和功能范围)

时间:2016-02-23 16:44:49

标签: r data.table snowfall

我在R编程领域有点新鲜,而且我处理的是与(大部分)大数据处理并行化相关的一些问题。

为此,我使用data.table包进行数据存储和处理,并将snowfall包作为包装来并行化工作。

我提出了一个特定的案例:我有一个大的元素向量,我想在每个元素上应用函数f(我使用向量化版本);然后我将大矢量平衡为N个部分(较小的矢量),如下所示:

 sfInit(parallel = TRUE, cpus = ncpus)
 balancedVector <-myVectorLoadBalanceFunction(myLargeVector, ncpus)
 processedSubVectors <- sfLapply(balancedVector, function(subVector) {
   myVectorizedFunction(subVector)
 })
 sfStop()

我看到奇怪的是,当我从命令行或脚本运行这段代码(即largeVector在全局环境中)时,性能在时间方面是好的,我在MS Windows任务管理器中看到每个核心似乎使用与subVector大小成比例的内存量;但是当我在一个函数环境中运行代码(即从命令行调用它并将largeVector作为参数传递)时,性能在时间上变得更糟,我检查每个核心现在似乎正在使用的完整副本largeVector ...

这有意义吗?

此致

已编辑以添加可重复的示例

为了简单起见,这是一个虚拟示例,日期向量为~300 MB,带有+36 M元素和工作日函数:

library(snowfall)

aSomewhatLargeVector <- seq.Date(from = as.Date("1900-01-01"), to = as.Date("2000-01-01"), by = 1)
aSomewhatLargeVector <- rep(aSomewhatLargeVector, 1000)

# Sequential version to compare

system.time(processedSubVectorsSequential <- weekdays(aSomewhatLargeVector))
# user  system elapsed 
# 108.05    1.06  109.53 

gc() # I restarted R



# Parallel version within a function scope

myCallingFunction = function(aSomewhatLargeVector) {
  sfInit(parallel = TRUE, cpus = 2)
  balancedVector <- list(aSomewhatLargeVector[seq(1, length(aSomewhatLargeVector)/2)],
                         aSomewhatLargeVector[seq(length(aSomewhatLargeVector)/2+1, length(aSomewhatLargeVector))])
  processedSubVectorsParallelFunction <- sfLapply(balancedVector, function(subVector) {
    weekdays(subVector)
  })
  sfStop() 
  processedSubVectorsParallelFunction <- unlist(processedSubVectorsParallelFunction)
  return(processedSubVectorsParallelFunction)
}

system.time(processedSubVectorsParallelFunction <- myCallingFunction(aSomewhatLargeVector))
# user  system elapsed 
# 11.63   10.61   94.27 
# user  system elapsed 
# 12.12    9.09   99.07 

gc() # I restarted R



# Parallel version within the global scope

time0 <- proc.time()
sfInit(parallel = TRUE, cpus = 2)
balancedVector <- list(aSomewhatLargeVector[seq(1, length(aSomewhatLargeVector)/2)],
                       aSomewhatLargeVector[seq(length(aSomewhatLargeVector)/2+1, length(aSomewhatLargeVector))])
processedSubVectorsParallel <- sfLapply(balancedVector, function(subVector) {
  weekdays(subVector)
})
sfStop() 
processedSubVectorsParallel <- unlist(processedSubVectorsParallel)
time1 <- proc.time()
time1-time0
# user  system elapsed 
# 7.94    4.75   85.14 
# user  system elapsed 
# 9.92    3.93   89.69 

我的时间出现在评论中,尽管这个虚拟示例没有那么显着的差异,但可以看出顺序时间&gt;并且在函数内并行&gt;在全球范围内并行

此外,您可以看到已分配内存的差异:

Memory comparison

3.3 GB&lt; 5.2 GB&gt; 4.4 GB

希望这有帮助

0 个答案:

没有答案