我使用R解析出一些生成如下列表的服务器日志:
myLog <- c("[1,2,3]","[4,5,6]","[7,8,9]")
我想从中产生的是一个如下所示的矩阵:
myMatrix <- matrix(c(c(1,2,3),c(4,5,6),c(7,8,9)),nrow=3,byrow=T)
他们来自查询varchar类型的数据库字段,所以我不认为我可以使用任何文件阅读技巧。
我倾向于拥有大量这些,一次数百万行。
我一直在做的是以下,它很慢:
splitDat <- sapply(inputVector,function(y){
y1 <- gsub("\\[","",y)
y2 <- gsub("\\]","",y1)
y3 <- strsplit(y2,split=", ")
y4 <- unlist(y3)
})
有更有效的方法吗?单行正则表达式?
答案 0 :(得分:8)
您可以尝试使用stringi
包
library(stringi)
matrix(as.numeric(unlist(stri_extract_all_regex(myLog, pattern = "\\d"))),
nrow = 3, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 4 5 6
# [3,] 7 8 9
<强>基准强>
library(stringi)
library(gsubfn)
library(microbenchmark)
set.seed(123)
myLog <- c("[1,2,3]","[4,5,6]","[7,8,9]")
myLog <- sample(myLog, 1e4, replace = TRUE)
Res <- microbenchmark(
David = matrix(as.numeric(unlist(stri_extract_all_regex(myLog, pattern = "\\d"))), nrow = 3, byrow = TRUE),
Thela = matrix(as.numeric(unlist(strsplit(myLog,"\\[|\\]|,"))),nrow=length(myLog),byrow=TRUE)[,-1],
BD1 = matrix(as.numeric(scan(text=gsub("\\D"," ",myLog),what="")), nrow=length(myLog),byrow=T),
BD2 = matrix(as.numeric(scan(text=gsub("[],[]"," ",myLog), what="")),nrow=length(myLog), byrow=T),
GG1 = read.table(text = gsub("\\D", " ", myLog)),
GG2 = read.pattern(text = myLog, pat = "\\d")
)
Res
# Unit: milliseconds
# expr min lq mean median uq max neval
# David 12.01351 12.90111 16.41127 13.98826 15.62786 101.65117 100
# Thela 25.49944 27.09937 29.83234 28.32153 30.24141 80.79836 100
# BD1 92.39541 94.81445 101.20524 98.07333 102.41877 172.60835 100
# BD2 91.91578 94.66958 104.02773 96.94019 103.99383 206.37865 100
# GG1 91.28813 94.29219 98.63825 96.57544 100.57172 140.97998 100
# GG2 470.43382 514.58552 551.94922 540.86479 570.88711 815.75789 100
boxplot(Res)
答案 1 :(得分:6)
1)我还没有检查过它的速度有多快,但代码很短:
library(gsubfn)
read.pattern(text = myLog, pat = "\\d")
其中myLog
与问题相同。
2)以下是基本解决方案:
read.table(text = gsub("\\D", " ", myLog))
答案 2 :(得分:4)
这似乎很快(在一百万个案例中约为2秒),但不如David的stringi
解决方案快:
matrix(as.numeric(unlist(strsplit(myLog,"\\[|\\]|,"))),nrow=length(myLog),
byrow=TRUE)[,-1]
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 4 5 6
#[3,] 7 8 9
对30K案例进行基准测试(除了前两个案例之外,所有案例都导致我的R会话在测试100万个案例时无法响应):
myLog <- c("[1,2,3]","[4,5,6]","[7,8,9]")
myLog <- sample(myLog, 30000,replace=TRUE)
最快的两个:
library(stringi)
system.time(
matrix(as.numeric(unlist(stri_extract_all_regex(myLog, pattern = "\\d"))),
nrow = 3, byrow = TRUE)
)
# user system elapsed
# 0.03 0.00 0.03
system.time(
matrix(as.numeric(unlist(strsplit(myLog,"\\[|\\]|,"))),nrow=length(myLog),
byrow=TRUE)[,-1]
)
# user system elapsed
# 0.05 0.00 0.04
中煤:
system.time(
matrix(as.numeric(scan(text=gsub("\\D"," ",myLog),what="")),
nrow=length(myLog),byrow=T)
)
#Read 90000 items
# user system elapsed
# 0.57 0.00 0.58
system.time(
matrix(as.numeric(scan(text=gsub("[],[]"," ",myLog), what="")),
nrow=length(myLog), byrow=T)
)
#Read 90000 items
# user system elapsed
# 0.59 0.00 0.59
system.time(
read.table(text = gsub("\\D", " ", myLog))
)
# user system elapsed
# 0.59 0.00 0.60
较慢:
library(gsubfn)
system.time(
read.pattern(text = myLog, pat = "\\d")
)
# user system elapsed
# 1.79 0.00 1.79
答案 3 :(得分:4)
myMatrix <- matrix(as.numeric(scan(text=gsub("[],[]"," ",myLog),
what="")),
nrow=length(myLog), byrow=T)
#Read 9 items
myMatrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
看到G_G的模式让我意识到数字的否定可以在gsub调用中使用:
> myMatrix <- matrix(as.numeric(scan(text=gsub("\\D"," ",myLog),what="")),nrow=length(myLog),byrow=T)