在R帮助上阅读了Memory leaks parsing XML in r(包括链接的帖子)和this帖子,并且鉴于已经过了一段时间,我仍然认为这是一个值得关注的未解决的问题{{3在整个R宇宙中广泛使用包。
因此,请将此视为后续帖子和/或参考,并希望提供有关问题的简明扼要的简洁说明。
解析XML / HTML文档后,可以使用XML
搜索它们,这需要内部使用C指针(AFAIU)。而且似乎至少在MS Windows上(我在Windows 8.1,64位上运行)这些引用都没有被垃圾收集器正确识别。因此消耗的内存没有被正确释放,导致某个时刻R进程冻结。
通过XML:free
或{解析XML / HTML文档时,似乎gc
和/或xmlParse
确实/不识别所有内存{1}}然后使用htmlParse
等处理它们:
报告的 OS任务(Rterm.exe)的内存使用量正在显着加快 ,而报告的R进程内存为“从R内部看到” (函数xpathApply
)适度增加(相比之下)。在下面的实质解析周期之前和之后,请参阅列表元素memory.size
,mem_r
和mem_os
。
总而言之,投入所有已推荐的内容(ratio
,free
和rm
),{{1}时内存使用率始终会增加调用之类的东西。这只是一个多少的问题。所以恕我直言必须仍然有一些不能正常工作的东西。
我借用了Duncan的Omegahat XPath中的分析代码。
一些准备工作:
gc
我们需要的功能:
xmlParse
快速事实:启用垃圾收集,XML文档解析Sys.setenv("LANGUAGE"="en")
require("compiler")
require("XML")
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] compiler stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] XML_3.98-1.1
次,但不通过getTaskMemoryByPid <- cmpfun(function(
pid=Sys.getpid()
) {
cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
mem
}, options=list(suppressAll=TRUE))
memoryLeak <- cmpfun(function(
x=system.file("exampleData", "mtcars.xml", package="XML"),
n=10000,
use_text=FALSE,
xpath=FALSE,
free_doc=FALSE,
clean_up=FALSE,
detailed=FALSE
) {
if(use_text) {
x <- readLines(x)
}
## Before //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_1 <- memory.profile()
mem_before <- list(mem_r=mem_r,
mem_os=mem_os, ratio=mem_os/mem_r)
## Per run //
mem_perrun <- lapply(1:n, function(ii) {
doc <- xmlParse(x, asText=use_text)
if (xpath) {
res <- xpathApply(doc=doc, path="/blah", fun=xmlValue)
rm(res)
}
if (free_doc) {
free(doc)
}
rm(doc)
out <- NULL
if (detailed) {
out <- list(
profile=memory.profile(),
size=memory.size()
)
}
out
})
has_perrun <- any(sapply(mem_perrun, length) > 0)
if (!has_perrun) {
mem_perrun <- NULL
}
## Garbage collect //
mem_gc <- NULL
if(clean_up) {
gc()
tmp <- gc()
mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"])
}
## After //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_2 <- memory.profile()
mem_after <- list(mem_r=mem_r,
mem_os=mem_os, ratio=mem_os/mem_r)
list(
before=mem_before,
perrun=mem_perrun,
gc=mem_gc,
after=mem_after,
comparison_r=data.frame(
before=prof_1,
after=prof_2,
increase=round((prof_2/prof_1)-1, 4)
),
increase_r=(mem_after$mem_r/mem_before$mem_r)-1,
increase_os=(mem_after$mem_os/mem_before$mem_os)-1
)
}, options=list(suppressAll=TRUE))
注意OS内存与R内存的比率:
之前:n
之后:xpathApply
1.364832
快速事实:已启用垃圾收集,1.322702
已明确调用,XML文档已解析res <- memoryLeak(clean_up=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-1.rdata"))
> res
$before
$before$mem_r
[1] 37.42
$before$mem_os
[1] 51.072
$before$ratio
[1] 1.364832
$perrun
NULL
$gc
$gc$gc_mb
[1] 45
$after
$after$mem_r
[1] 63.21
$after$mem_os
[1] 83.608
$after$ratio
[1] 1.322702
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7387 7392 0.0007
pairlist 190383 390633 1.0518
closure 5077 55085 9.8499
environment 1032 51032 48.4496
promise 5226 105226 19.1351
language 54675 54791 0.0021
special 44 44 0.0000
builtin 648 648 0.0000
char 8746 8763 0.0019
logical 9081 9084 0.0003
integer 22804 22807 0.0001
double 2773 2783 0.0036
complex 1 1 0.0000
character 44522 94569 1.1241
... 0 0 NaN
any 0 0 NaN
list 19946 19951 0.0003
expression 1 1 0.0000
bytecode 16049 16050 0.0001
externalptr 1487 1487 0.0000
weakref 391 391 0.0000
raw 392 392 0.0000
S4 1392 1392 0.0000
$increase_r
[1] 0.6892036
$increase_os
[1] 0.6370614
次,但未通过free
搜索。
注意OS内存与R内存的比率:
之前:n
之后:xpathApply
1.315249
快速事实:启用垃圾收集,明确调用1.222143
,每次都通过res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res
$before
$before$mem_r
[1] 63.48
$before$mem_os
[1] 83.492
$before$ratio
[1] 1.315249
$perrun
NULL
$gc
$gc$gc_mb
[1] 69.3
$after
$after$mem_r
[1] 95.92
$after$mem_os
[1] 117.228
$after$ratio
[1] 1.222143
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7454 0.0000
pairlist 392455 592466 0.5096
closure 55104 105104 0.9074
environment 51032 101032 0.9798
promise 105226 205226 0.9503
language 55592 55592 0.0000
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8848 0.0001
logical 9141 9141 0.0000
integer 23109 23111 0.0001
double 2802 2807 0.0018
complex 1 1 0.0000
character 94775 144781 0.5276
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.5110271
$increase_os
[1] 0.4040627
解析XML文档free
次和搜索。
注意OS内存与R内存的比率:
之前:n
之后:xpathApply
(!)
1.220429
我也尝试了不同的版本。好吧,我尝试尝试; - )
仅供参考:最新的Rtools 3.1已安装并包含在Windows 13.15629
中(例如安装res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94
$before$mem_os
[1] 117.088
$before$ratio
[1] 1.220429
$perrun
NULL
$gc
$gc$gc_mb
[1] 93.4
$after
$after$mem_r
[1] 124.64
$after$mem_os
[1] 1639.8
$after$ratio
[1] 13.15629
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7460 0.0008
pairlist 592458 793042 0.3386
closure 105104 155110 0.4758
environment 101032 151032 0.4949
promise 205226 305226 0.4873
language 55592 55882 0.0052
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8867 0.0023
logical 9142 9162 0.0022
integer 23109 23112 0.0001
double 2802 2832 0.0107
complex 1 1 0.0000
character 144775 194819 0.3457
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.2991453
$increase_os
[1] 13.00485
表单,源代码工作正常)。
PATH
我没有按照github repo上git repository中的建议进行操作,因为它指的是README只包含stringr
版本> install.packages("XML", repos="http://www.omegahat.org/R", type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb
* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
The downloaded source packages are in
'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1
2: In install.packages("XML", repos = "http://www.omegahat.org/R", :
installation of package 'XML' had non-zero exit status
(我们是在CRAN的tar.gz
。
尽管有人说gihub repo不是标准的R包结构,但我还是用3.94-0
尝试了 - 并且失败了; - )
3.98-1.1
答案 0 :(得分:4)
虽然它还处于起步阶段(只有几个月的时间!),并且有一些怪癖,但Hadley Wickham编写了一个用于XML解析的库xml2
,可以在Github上找到{ {3}}。它仅限于阅读而不是编写XML,但是对于解析XML我一直在进行实验,看起来它会完成工作,没有xml包的内存泄漏!它提供的功能包括:
read_xml()
读取XML文件xml_children()
获取节点的子节点xml_text()
获取标记内的文字xml_attrs()
获取节点属性和值的字符向量,可以使用as.list()
请注意,在完成XML节点对象之后,仍然需要确保rm()
,并强制使用gc()
进行垃圾回收,但内存实际上会释放到O / S(免责声明:仅在Windows 7上测试,但这似乎是最“内存泄漏”的平台)。
希望这有助于某人!
答案 1 :(得分:1)
根据Matthew Wise上面关于使用xml2的回答,我发现真正释放内存的函数是xml_remove()
,后跟gc()
,而不是rm()
。
答案 2 :(得分:0)
自从我发布问题以来,没有发生任何事情,所以我想我会再次引起注意。
这是我的调查的略微更新版本
require("rvest")
require("XML")
getTaskMemoryByPid <- function(
pid = Sys.getpid()
) {
cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
mem
}
getCurrentMemoryStatus <- function() {
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_1 <- memory.profile()
list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}
memoryLeak <- function(
x = system.file("exampleData", "mtcars.xml", package="XML"),
n = 10000,
use_text = FALSE,
xpath = FALSE,
free_doc = FALSE,
clean_up = FALSE,
detailed = FALSE,
use_rvest = FALSE,
user_agent = httr::user_agent("Mozilla/5.0")
) {
if(use_text) {
x <- readLines(x)
}
## Before //
prof_1 <- memory.profile()
mem_before <- getCurrentMemoryStatus()
## Per run //
mem_perrun <- lapply(1:n, function(ii) {
doc <- if (!use_rvest) {
xmlParse(x, asText = use_text)
} else {
if (file.exists(x)) {
## From disk //
rvest::html(x)
} else {
## From web //
rvest::html_session(x, user_agent)
}
}
if (xpath) {
res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
rm(res)
}
if (free_doc) {
free(doc)
}
rm(doc)
out <- NULL
if (detailed) {
out <- list(
profile = memory.profile(),
size = memory.size()
)
}
out
})
has_perrun <- any(sapply(mem_perrun, length) > 0)
if (!has_perrun) {
mem_perrun <- NULL
}
## Garbage collect //
mem_gc <- NULL
if(clean_up) {
gc()
tmp <- gc()
mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
}
## After //
prof_2 <- memory.profile()
mem_after <- getCurrentMemoryStatus()
## Return value //
if (detailed) {
list(
before = mem_before,
perrun = mem_perrun,
gc = mem_gc,
after = mem_after,
comparison_r = data.frame(
before = prof_1,
after = prof_2,
increase = round((prof_2/prof_1)-1, 4)
),
increase_r = (mem_after$r/mem_before$r)-1,
increase_os = (mem_after$os/mem_before$os)-1
)
} else {
list(
before_after = data.frame(
r = c(mem_before$r, mem_after$r),
os = c(mem_before$os, mem_after$os)
),
increase_r = (mem_after$r/mem_before$r)-1,
increase_os = (mem_after$os/mem_before$os)-1
)
}
}
getCurrentMemoryStatus()
s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")
s <- html_session(
"http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")
getCurrentMemoryStatus()
################
## Mtcars.xml ##
################
res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)
res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)
res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)
###################
## www.had.co.nz ##
###################
## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE,
detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)
## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)
####################
## www.amazon.com ##
####################
## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE,
detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)
## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)