我使用strsplit
耗尽内存(大概);这是代码:
split.fields <- function (frame, fields, split, suffix, ...) {
for (field in fields) {
v <- sapply(strsplit(frame[[field]],"@",...),"[",1)
frame[[paste0(field,suffix)]] <- frame[[field]]
frame[[field]] <- v
}
frame
}
split.version <- function (frame, fields)
split.fields(frame, fields, split="@", suffix="Ver", fixed=TRUE)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 238165 12.8 467875 25 407500 21.8
Vcells 369492 2.9 905753 7 905631 7.0
> frame <- data.frame(browser = sample(c("Chrome@28","Chrome@27","Firefox@21","Firefox@22","IE@9","IE@8"), 1e7, replace=TRUE), stringsAsFactors=FALSE)
> str(frame)
'data.frame': 10000000 obs. of 1 variable:
$ browser: chr "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> object.size(frame)
80000992 bytes
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 240555 12.9 467875 25.0 407500 21.8
Vcells 10373979 79.2 34109873 260.3 40534688 309.3
> system.time(frame <- split.version(frame,"browser"))
user system elapsed
73.700 0.872 74.831
> object.size(frame)
160001248 bytes
> str(frame)
'data.frame': 10000000 obs. of 2 variables:
$ browser : chr "IE" "Chrome" "Chrome" "Chrome" ...
$ browserVer: chr "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 264888 14.2 16652260 889.4 31376740 1675.7
Vcells 20459856 156.1 95461025 728.4 119226749 909.7
除了R
进程的RSS现在 1.6G 之外,这看起来或多或少都是合理的。
这似乎意味着max used
中的1675.7Mb的Ncells
列尚未返回操作系统。
我不太关心操作系统没有取回RAM,我关心的是什么 是处理80M的数据R分配1.6G(并在我的真实数据上 耗尽可用的物理RAM)
有没有办法让这个内存更有效?
,例如,可能将角色向量转换为因子并进行操作 它的水平会有帮助吗?
R version 3.0.1 (2013-05-16) -- "Good Sport"
Platform: x86_64-pc-linux-gnu (64-bit)
答案 0 :(得分:4)
如何使用substr
和regexpr
:
x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
substr(x,1,regexpr("@",x)-1)
[1] "Chrome" "Chrome" "Firefox" "IE"
答案 1 :(得分:3)
@James说的话,甚至更简单:
x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
sub('@.*', '', x)
#[1] "Chrome" "Chrome" "Firefox" "IE"