如何处理许多文件而不会耗尽R中的内存

时间:2016-06-20 13:20:33

标签: r

我正在尝试解析目录中的数千个文件。我写了这个脚本去想每个文件,但我的系统内存不足。我还能如何处理这些文件?

dir_path<-c("C:/Documents/Data")
input_files <- list.files(path = input_path, pattern = "htm",  full.names = TRUE)
nn = length(input_files)
total_data = data.table(server = as.character(), time = as.character())
for(i in 1:nn)
  { 
     xmlobj = xmlTreeParse(file = input_files[i],  isHTML = T)
     r = xmlRoot(xmlobj)
     server = xmlValue(r[[2]][1][1]$h1)
     time = xmlValue(r[[2]][4][1]$dl[1]$dt)
     web_data = rbind(web_data, data.frame(server, time))
     total_data<-rbind(web_data, total_data)
     gc()

  }

我正在阅读的每个文件都有此内容。 htm格式的文件:

webserver101的Apache服务器状态

Server Version: IBM_HTTP_Server/7.0.0.39 (Unix)
Server Built: Aug 3 2015 17:29:08 

Current Time: Sunday, 05-Jun-2016 13:56:27 EDT
Restart Time: Saturday, 04-Jun-2016 23:06:02 EDT
Parent Server Generation: 0
Server uptime: 14 hours 50 minutes 24 seconds
Total accesses: 39855 - Total Traffic: 1.2 GB
CPU Usage: u814.13 s13.33 cu0 cs0 - 1.55% CPU load.746 
requests/sec - 24.2 kB/second - 32.5 kB/request7 
requests currently being processed, 73 idle workers

1 个答案:

答案 0 :(得分:2)

您可以尝试导入列表中的所有文件,然后处理列表:

all_files <- lapply(input_files, xmlTreeParse, isHTML=TRUE)
process_files <- lapply(all_files, function(myfile){
    r = xmlRoot(xmlobj)
    server = xmlValue(r[[2]][1][1]$h1)
    time = xmlValue(r[[2]][4][1]$dl[1]$dt)
    web_data = data.frame(server, time, stringsAsFactors=FALSE)
    web_data
}
total_data <- do.call(rbind, process_files)

如果您需要以块的形式分割数据,可以使用函数seq来获取块的开始索引:

seq_ind <- seq(1, length(input_files), by=1000)

然后,您可以使用

获取与每个块相对应的文件列表
files_in_chunks <- mapply(function(x, y) input_files[x:y], x=seq_ind, y=c(seq_ind[-1], length(input_files)), SIMPLIFY=FALSE)