我正在一个非常大的数据集(比示例大得多)上制作各种ggplot
。我在x轴和y轴上都创建了装仓功能,以便能够绘制如此大的数据集。
在下面的示例中,memory.size()
被记录在开头。然后将大型数据集模拟为dt
。通过合并将dt
的{{1}}与x2
作图。使用x1
的不同子集重复绘图。绘制对象的大小由dt
检查并存储。创建绘图对象后,将执行object.size()
,然后执行两次rm(dt)
。此时,gc()
再次被记录。最后,将结尾的memory.size()
与开头的memory.size()
进行比较。
鉴于所绘制对象的尺寸较小,预计结尾处的memory.size()
应该与开始处的相似。 但是没有。 memory.size()
在我重新开始新的R会话之前不会掉线。
可复制的示例
library(data.table)
library(ggplot2)
library(magrittr)
# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv) {
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) {return(mean(xBreaks[c(i, i+1)]))})[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) {return(mean(yBreaks[c(i, i+1)]))})[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)
}
before.mem.size <- memory.size()
# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) {return(runif(nrow))}) %>% set_names(paste0("x", seq(ncol))))
# Graph plotting
dummyEnv <- new.env()
with(dummyEnv, {
fcn <- function(tab) {
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)
}
lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
})
rm(dummyEnv)
# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()
# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))
我尝试对代码进行以下修改:
fcn
内,删除ggplot
并返回NULL
而不是绘图对象:内存泄漏已完全消失。但这不是解决方案。我需要情节。fcn
的请求的图越少/列越少/行越少,内存泄漏就越少。rm(list = ls())
之后,该内存仍然不可恢复。我想知道为什么会发生这种情况,以及如何消除这种情况,而又不影响我要进行合并图和子集dt
来制作不同图的需求。
感谢关注!