我正在尝试解析目录中的数千个文件。我写了这个脚本去想每个文件,但我的系统内存不足。我还能如何处理这些文件?
dir_path<-c("C:/Documents/Data")
input_files <- list.files(path = input_path, pattern = "htm", full.names = TRUE)
nn = length(input_files)
total_data = data.table(server = as.character(), time = as.character())
for(i in 1:nn)
{
xmlobj = xmlTreeParse(file = input_files[i], isHTML = T)
r = xmlRoot(xmlobj)
server = xmlValue(r[[2]][1][1]$h1)
time = xmlValue(r[[2]][4][1]$dl[1]$dt)
web_data = rbind(web_data, data.frame(server, time))
total_data<-rbind(web_data, total_data)
gc()
}
我正在阅读的每个文件都有此内容。 htm格式的文件:
webserver101的Apache服务器状态
Server Version: IBM_HTTP_Server/7.0.0.39 (Unix)
Server Built: Aug 3 2015 17:29:08
Current Time: Sunday, 05-Jun-2016 13:56:27 EDT
Restart Time: Saturday, 04-Jun-2016 23:06:02 EDT
Parent Server Generation: 0
Server uptime: 14 hours 50 minutes 24 seconds
Total accesses: 39855 - Total Traffic: 1.2 GB
CPU Usage: u814.13 s13.33 cu0 cs0 - 1.55% CPU load.746
requests/sec - 24.2 kB/second - 32.5 kB/request7
requests currently being processed, 73 idle workers
答案 0 :(得分:2)
您可以尝试导入列表中的所有文件,然后处理列表:
all_files <- lapply(input_files, xmlTreeParse, isHTML=TRUE)
process_files <- lapply(all_files, function(myfile){
r = xmlRoot(xmlobj)
server = xmlValue(r[[2]][1][1]$h1)
time = xmlValue(r[[2]][4][1]$dl[1]$dt)
web_data = data.frame(server, time, stringsAsFactors=FALSE)
web_data
}
total_data <- do.call(rbind, process_files)
如果您需要以块的形式分割数据,可以使用函数seq
来获取块的开始索引:
seq_ind <- seq(1, length(input_files), by=1000)
然后,您可以使用
获取与每个块相对应的文件列表files_in_chunks <- mapply(function(x, y) input_files[x:y], x=seq_ind, y=c(seq_ind[-1], length(input_files)), SIMPLIFY=FALSE)