目录中有一个名为c:/ logs的大文件。我需要访问每个日志,grep一些正则表达式并将输出放到数据框中。这就是我所拥有的,它需要很长时间才能运行:
final<-data.frame()
path<-c("C:/logs")
logs<-dir(path, pattern = "log", full.names = TRUE, ignore.case = TRUE)
toMatch <- c("compile.cancelSwapCode", "compile.insertCode", "compile.getCode",
"compile.getCodes","compile.getCodeWithAnalysis")
for (i in logs){
print(i)
xx <- readLines(i)
xxx<-grep(paste(toMatch,collapse="|"), xx, value=TRUE)
df<-read.table(text=xxx)
final<-rbind(final, df)
}
实际文件sisez是150MB。有快速的方法吗?
示例日志文件:
2016-11-02 00:00:01,506 INFO [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - Received request for test request: [ticket=test101]
2016-11-02 00:00:01,514 INFO [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - request: [ticket=test101] found in cache, returning from Cache
2016-11-02 00:00:01,515 DEBUG [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - compileController.Code finished in 9ms
2016-11-02 00:00:01,578 INFO [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - Received request for test request: [ticket=test101]
2016-11-02 00:00:01,582 INFO [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - request: [ticket=test101] found in cache, returning from Cache
2016-11-02 00:00:01,582 DEBUG [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - compileController.Code finished in 4ms
2016-11-02 00:00:08,606 INFO [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - Received request for test request: [ticket=test102]
2016-11-02 00:00:08,606 INFO [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - request: [ticket=test102] found in cache, returning from Cache
2016-11-02 00:00:08,606 DEBUG [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - compileController.Code finished in 0ms
2016-11-02 00:00:09,320 INFO [[(JSK) mux request dispatch][/][tid=566][compileController.Code]] - Received request for test request: [ticket=test102]
2016-11-02 00:00:09,320 INFO [[(JSK) mux request dispatch][/][tid=566][compileController.Code]] - request: [ticket=test102] found in cache, returning from Cache
答案 0 :(得分:2)
代码中的主要瓶颈似乎是对read.table
的调用。通过首先将向量xxx
连接到单个换行符分隔的字符串中,您可以将其传递给data.tables&#39; s fread
,这要快得多。此外,我已使用grep
包中的stri_subset
替换了stringi
的来电,并使用data.table&#39; rbindlist
来组合所有内容,而非而不是迭代地调用rbind
。我不确定您总共有多少个文件,但在下面的代码中,我使用了从您的示例创建的9个180 MB文件。
library(data.table)
library(stringi)
files <- list.files(
"/tmp",
pattern = "logfile",
full.names = TRUE
)
re <- paste(
c("compile.cancelSwapCode", "compile.insertCode",
"compile.getCode", "compile.getCodes",
"compile.getCodeWithAnalysis", "DEBUG"),
collapse = "|"
)
system.time({
final <- rbindlist(
lapply(files, function(f) {
lines <- readLines(f)
fread(
paste0(
stri_subset(lines, regex = re),
collapse = "\n"
),
header = FALSE,
sep = " ",
colClasses = "character"
)
})
)
})
# user system elapsed
# 136.664 0.628 137.378
看起来你在Windows上,所以如果你已经安装了Rtools并且能够利用Rcpp,你可以通过用一个简单的C ++等价物替换readLines
来节省更多的时间: / p>
#include <Rcpp.h>
#include <fstream>
// [[Rcpp::export]]
Rcpp::CharacterVector read_lines(const char* file) {
std::vector<std::string> res;
res.reserve(10000);
std::ifstream stream(file);
std::string tmp;
while (std::getline(stream, tmp)) {
res.push_back(tmp);
}
return Rcpp::wrap(res);
}
在上面用readLines
替换read_lines
,我得到了
# user system elapsed
# 39.200 0.376 39.596
如果您无法使用Rtools和Rcpp启动并运行,则使用read_lines
包中的readr
仍然会比readLines
更快。以下是这三个180 MB文件中所有三个的比较:
system.time({ lapply(files, readLines) })
# user system elapsed
# 287.136 1.140 288.471
system.time({ lapply(files, readr::read_lines) })
# user system elapsed
# 91.568 0.604 91.895
system.time({ lapply(files, read_lines) })
# user system elapsed
# 24.204 0.652 24.862
更新
关于你的评论如下,IIUC你可以这样做:
final <- rbindlist(
lapply(files, function(f) {
lines <- readLines(f)
matches <- stri_subset(lines, regex = re)
data.table(
stri_match_first(
matches,
regex = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"
)
)
})
)[,.(Count = .N), by = "V1"]