R:使用`strsplit`耗尽内存

时间:2013-07-15 17:29:30

标签: r memory-management

我使用strsplit耗尽内存(大概);这是代码:

split.fields <- function (frame, fields, split, suffix, ...) {
  for (field in fields) {
    v <- sapply(strsplit(frame[[field]],"@",...),"[",1)
    frame[[paste0(field,suffix)]] <- frame[[field]]
    frame[[field]] <- v
  }
  frame
}
split.version <- function (frame, fields)
  split.fields(frame, fields, split="@", suffix="Ver", fixed=TRUE)
> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 238165 12.8     467875   25   407500 21.8
Vcells 369492  2.9     905753    7   905631  7.0
> frame <- data.frame(browser = sample(c("Chrome@28","Chrome@27","Firefox@21","Firefox@22","IE@9","IE@8"), 1e7, replace=TRUE), stringsAsFactors=FALSE)
> str(frame)
'data.frame':   10000000 obs. of  1 variable:
 $ browser: chr  "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> object.size(frame)
80000992 bytes
> gc()
           used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   240555 12.9     467875  25.0   407500  21.8
Vcells 10373979 79.2   34109873 260.3 40534688 309.3
> system.time(frame <- split.version(frame,"browser"))
   user  system elapsed 
 73.700   0.872  74.831 
> object.size(frame)
160001248 bytes
> str(frame)
'data.frame':   10000000 obs. of  2 variables:
 $ browser   : chr  "IE" "Chrome" "Chrome" "Chrome" ...
 $ browserVer: chr  "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> gc()
           used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells   264888  14.2   16652260 889.4  31376740 1675.7
Vcells 20459856 156.1   95461025 728.4 119226749  909.7

除了R进程的RSS现在 1.6G 之外,这看起来或多或少都是合理的。

这似乎意味着max used中的1675.7Mb的Ncells 列尚未返回操作系统。

我不太关心操作系统没有取回RAM,我关心的是什么 是处理80M的数据R分配1.6G(并在我的真实数据上 耗尽可用的物理RAM)

有没有办法让这个内存更有效?

,例如,可能将角色向量转换为因子并进行操作 它的水平会有帮助吗?

R version 3.0.1 (2013-05-16) -- "Good Sport"
Platform: x86_64-pc-linux-gnu (64-bit)

2 个答案:

答案 0 :(得分:4)

如何使用substrregexpr

x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
substr(x,1,regexpr("@",x)-1)
[1] "Chrome"  "Chrome"  "Firefox" "IE" 

答案 1 :(得分:3)

@James说的话,甚至更简单:

x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
sub('@.*', '', x)
#[1] "Chrome"  "Chrome"  "Firefox" "IE"