在Windows上使用包XML时内存泄漏

时间:2014-05-16 13:30:58

标签: xml windows r parsing memory-leaks

在R帮助上阅读了Memory leaks parsing XML in r(包括链接的帖子)和this帖子,并且鉴于已经过了一段时间,我仍然认为这是一个值得关注的未解决的问题{{3在整个R宇宙中广泛使用包。

因此,请将此视为后续帖子和/或参考,并希望提供有关问题的简明扼要的简洁说明

问题

解析XML / HTML文档后,可以使用XML搜索它们,这需要内部使用C指针(AFAIU)。而且似乎至少在MS Windows上(我在Windows 8.1,64位上运行)这些引用都没有被垃圾收集器正确识别。因此消耗的内存没有被正确释放,导致某个时刻R进程冻结。

到目前为止的中心调查结果

通过XML:free或{解析XML / HTML文档时,似乎gc和/或xmlParse确实/不识别所有内存{1}}然后使用htmlParse等处理它们:

报告的 OS任务(Rterm.exe)的内存使用量正在显着加快 ,而报告的R进程内存为“从R内部看到” (函数xpathApply)适度增加(相比之下)。在下面的实质解析周期之前和之后,请参阅列表元素memory.sizemem_rmem_os

总而言之,投入所有已推荐的内容(ratiofreerm),{{1}时内存使用率始终会增加调用之类的东西。这只是一个多少的问题。所以恕我直言必须仍然有一些不能正常工作的东西。


插图

我借用了Duncan的Omegahat XPath中的分析代码。

一些准备工作:

gc

我们需要的功能:

xmlParse

结果

场景1

快速事实:启用垃圾收集,XML文档解析Sys.setenv("LANGUAGE"="en") require("compiler") require("XML") > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] compiler stats graphics grDevices utils datasets methods [8] base other attached packages: [1] XML_3.98-1.1 次,但通过getTaskMemoryByPid <- cmpfun(function( pid=Sys.getpid() ) { cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid) mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5] mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000 mem }, options=list(suppressAll=TRUE)) memoryLeak <- cmpfun(function( x=system.file("exampleData", "mtcars.xml", package="XML"), n=10000, use_text=FALSE, xpath=FALSE, free_doc=FALSE, clean_up=FALSE, detailed=FALSE ) { if(use_text) { x <- readLines(x) } ## Before // mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_1 <- memory.profile() mem_before <- list(mem_r=mem_r, mem_os=mem_os, ratio=mem_os/mem_r) ## Per run // mem_perrun <- lapply(1:n, function(ii) { doc <- xmlParse(x, asText=use_text) if (xpath) { res <- xpathApply(doc=doc, path="/blah", fun=xmlValue) rm(res) } if (free_doc) { free(doc) } rm(doc) out <- NULL if (detailed) { out <- list( profile=memory.profile(), size=memory.size() ) } out }) has_perrun <- any(sapply(mem_perrun, length) > 0) if (!has_perrun) { mem_perrun <- NULL } ## Garbage collect // mem_gc <- NULL if(clean_up) { gc() tmp <- gc() mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"]) } ## After // mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_2 <- memory.profile() mem_after <- list(mem_r=mem_r, mem_os=mem_os, ratio=mem_os/mem_r) list( before=mem_before, perrun=mem_perrun, gc=mem_gc, after=mem_after, comparison_r=data.frame( before=prof_1, after=prof_2, increase=round((prof_2/prof_1)-1, 4) ), increase_r=(mem_after$mem_r/mem_before$mem_r)-1, increase_os=(mem_after$mem_os/mem_before$mem_os)-1 ) }, options=list(suppressAll=TRUE))

搜索

注意OS内存与R内存的比率:

之前:n

之后:xpathApply

1.364832

场景2

快速事实:已启用垃圾收集,1.322702已明确调用,XML文档已解析res <- memoryLeak(clean_up=TRUE, n=50000) save(res, file=file.path(tempdir(), "memory-profile-1.rdata")) > res $before $before$mem_r [1] 37.42 $before$mem_os [1] 51.072 $before$ratio [1] 1.364832 $perrun NULL $gc $gc$gc_mb [1] 45 $after $after$mem_r [1] 63.21 $after$mem_os [1] 83.608 $after$ratio [1] 1.322702 $comparison_r before after increase NULL 1 1 0.0000 symbol 7387 7392 0.0007 pairlist 190383 390633 1.0518 closure 5077 55085 9.8499 environment 1032 51032 48.4496 promise 5226 105226 19.1351 language 54675 54791 0.0021 special 44 44 0.0000 builtin 648 648 0.0000 char 8746 8763 0.0019 logical 9081 9084 0.0003 integer 22804 22807 0.0001 double 2773 2783 0.0036 complex 1 1 0.0000 character 44522 94569 1.1241 ... 0 0 NaN any 0 0 NaN list 19946 19951 0.0003 expression 1 1 0.0000 bytecode 16049 16050 0.0001 externalptr 1487 1487 0.0000 weakref 391 391 0.0000 raw 392 392 0.0000 S4 1392 1392 0.0000 $increase_r [1] 0.6892036 $increase_os [1] 0.6370614 次,但通过free搜索。

注意OS内存与R内存的比率:

之前:n

之后:xpathApply

1.315249

场景3

快速事实:启用垃圾收集,明确调用1.222143,每次都通过res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000) save(res, file=file.path(tempdir(), "memory-profile-2.rdata")) > res $before $before$mem_r [1] 63.48 $before$mem_os [1] 83.492 $before$ratio [1] 1.315249 $perrun NULL $gc $gc$gc_mb [1] 69.3 $after $after$mem_r [1] 95.92 $after$mem_os [1] 117.228 $after$ratio [1] 1.222143 $comparison_r before after increase NULL 1 1 0.0000 symbol 7454 7454 0.0000 pairlist 392455 592466 0.5096 closure 55104 105104 0.9074 environment 51032 101032 0.9798 promise 105226 205226 0.9503 language 55592 55592 0.0000 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8848 0.0001 logical 9141 9141 0.0000 integer 23109 23111 0.0001 double 2802 2807 0.0018 complex 1 1 0.0000 character 94775 144781 0.5276 ... 0 0 NaN any 0 0 NaN list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 S4 1392 1392 0.0000 $increase_r [1] 0.5110271 $increase_os [1] 0.4040627 解析XML文档free次和搜索

注意OS内存与R内存的比率:

之前:n

之后:xpathApply(!)

1.220429

我也尝试了不同的版本。好吧,我尝试尝试; - )

来自omegahat.org

仅供参考:最新的Rtools 3.1已安装并包含在Windows 13.15629中(例如安装res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000) save(res, file=file.path(tempdir(), "memory-profile-3.rdata")) res $before $before$mem_r [1] 95.94 $before$mem_os [1] 117.088 $before$ratio [1] 1.220429 $perrun NULL $gc $gc$gc_mb [1] 93.4 $after $after$mem_r [1] 124.64 $after$mem_os [1] 1639.8 $after$ratio [1] 13.15629 $comparison_r before after increase NULL 1 1 0.0000 symbol 7454 7460 0.0008 pairlist 592458 793042 0.3386 closure 105104 155110 0.4758 environment 101032 151032 0.4949 promise 205226 305226 0.4873 language 55592 55882 0.0052 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8867 0.0023 logical 9142 9162 0.0022 integer 23109 23112 0.0001 double 2802 2832 0.0107 complex 1 1 0.0000 character 144775 194819 0.3457 ... 0 0 NaN any 0 0 NaN list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 S4 1392 1392 0.0000 $increase_r [1] 0.2991453 $increase_os [1] 13.00485 表单,源代码工作正常)。

PATH

Github上

我没有按照github repo上git repository中的建议进行操作,因为它指的是README只包含stringr版本> install.packages("XML", repos="http://www.omegahat.org/R", type="source") trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz' Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb) opened URL downloaded 1.5 Mb * installing *source* package 'XML' ... Please define LIB_XML (and LIB_ZLIB, LIB_ICONV) Warning: running command 'sh ./configure.win' had status 1 ERROR: configuration failed for package 'XML' * removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' * restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' The downloaded source packages are in 'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages' Warning messages: 1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1 2: In install.packages("XML", repos = "http://www.omegahat.org/R", : installation of package 'XML' had non-zero exit status (我们是在CRAN的tar.gz

尽管有人说gihub repo不是标准的R包结构,但我还是用3.94-0尝试了 - 并且失败了; - )

3.98-1.1

3 个答案:

答案 0 :(得分:4)

虽然它还处于起步阶段(只有几个月的时间!),并且有一些怪癖,但Hadl​​ey Wickham编写了一个用于XML解析的库xml2,可以在Github上找到{ {3}}。它仅限于阅读而不是编写XML,但是对于解析XML我一直在进行实验,看起来它会完成工作,没有xml包的内存泄漏!它提供的功能包括:

  • read_xml()读取XML文件
  • xml_children()获取节点的子节点
  • xml_text()获取标记内的文字
  • xml_attrs()获取节点属性和值的字符向量,可以使用as.list()
  • 强制转换为命名列表

请注意,在完成XML节点对象之后,仍然需要确保rm(),并强制使用gc()进行垃圾回收,但内存实际上会释放到O / S(免责声明:仅在Windows 7上测试,但这似乎是最“内存泄漏”的平台)。

希望这有助于某人!

答案 1 :(得分:1)

根据Matthew Wise上面关于使用xml2的回答,我发现真正释放内存的函数是xml_remove(),后跟gc(),而不是rm()

答案 2 :(得分:0)

自从我发布问题以来,没有发生任何事情,所以我想我会再次引起注意。

这是我的调查的略微更新版本

预赛

require("rvest")
require("XML")

功能

getTaskMemoryByPid <- function(
  pid = Sys.getpid()
) {
  cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
  mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
  mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
  mem
}  

getCurrentMemoryStatus <- function() {
  mem_os  <- getTaskMemoryByPid()
  mem_r   <- memory.size()
  prof_1  <- memory.profile()
  list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}

memoryLeak <- function(
  x = system.file("exampleData", "mtcars.xml", package="XML"),
  n = 10000,
  use_text = FALSE,
  xpath = FALSE,
  free_doc = FALSE,
  clean_up = FALSE,
  detailed = FALSE,
  use_rvest = FALSE,
  user_agent = httr::user_agent("Mozilla/5.0")
) {
  if(use_text) {
    x <- readLines(x)
  }
  ## Before //
  prof_1  <- memory.profile()
  mem_before <- getCurrentMemoryStatus()

  ## Per run //
  mem_perrun <- lapply(1:n, function(ii) {
    doc <- if (!use_rvest) {
      xmlParse(x, asText = use_text)
    } else {
      if (file.exists(x)) {
      ## From disk //        
        rvest::html(x)  
      } else {
      ## From web //
        rvest::html_session(x, user_agent)  
      }
    }
    if (xpath) {
      res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
      rm(res)
    }
    if (free_doc) {
      free(doc)
    }
    rm(doc)
    out <- NULL
    if (detailed) {
      out <- list(
        profile = memory.profile(),
        size = memory.size()
      )
    } 
    out
  })
  has_perrun <- any(sapply(mem_perrun, length) > 0)
  if (!has_perrun) {
    mem_perrun <- NULL
  } 

  ## Garbage collect //
  mem_gc <- NULL
  if(clean_up) {
    gc()
    tmp <- gc()
    mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
  }

  ## After //
  prof_2  <- memory.profile()
  mem_after <- getCurrentMemoryStatus()

  ## Return value //
  if (detailed) {
    list(
      before = mem_before, 
      perrun = mem_perrun, 
      gc = mem_gc, 
      after = mem_after, 
      comparison_r = data.frame(
        before = prof_1, 
        after = prof_2, 
        increase = round((prof_2/prof_1)-1, 4)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  } else {
    list(
      before_after = data.frame(
        r = c(mem_before$r, mem_after$r),
        os = c(mem_before$os, mem_after$os)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  }
}

在请求任何内容之前的内存状态

getCurrentMemoryStatus()

生成其他离线示例内容

s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")

s <- html_session(
  "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")

getCurrentMemoryStatus()

仿形

################
## Mtcars.xml ##
################

res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)

###################
## www.had.co.nz ##
###################

## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)

####################
## www.amazon.com ##
####################

## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)