在R中,我希望将一系列日志数据解析为分类事件。
我有一个向量调用regex_text,它是一个连续的字符串(为清楚起见,添加了换行符):
21/08/2014 22:58CONTENT_ACCESS.preparing
21/08/2014 23:00EXE_IN.preparing
21/08/2014 23:07CONTENT_ACCESS.preparing
21/08/2014 23:08CONTENT_ACCESS.preparing
21/08/2014 23:12EXE_CO.preparing
21/08/2014 23:28EXE_IN.preparing
21/08/2014 23:29CONTENT_ACCESS.preparing
21/08/2014 23:30CONTENT_ACCESS.preparing
,并希望使用正则表达式处理“ CONTENT_ACCESS.preparing”的每个序列的第一个和最后一个时间戳,并将它们放入此数据帧中:
start_ts stop_ts
1 21/08/2014 22:58 21/08/2014 22:58
start_ts stop_ts
2 21/08/2014 23:07 21/08/2014 23:08
start_ts stop_ts
3 21/08/2014 23:29 21/08/2014 23:30
“ CONTENT_ACCESS.preparing”可能有很多重复,我的示例只有两个实例,每个实例中都有两个条目。
下面的代码可以直接在R中运行,并且当前输出一个条目:
start_ts stop_ts
1 21/08/2014 23:07 21/08/2014 23:30
我希望获得有关如何提取两个条目的指南,如上所述
代码:
library(stringr)
options(stringsAsFactors = FALSE)
eventised_session <- data.frame(start_ts=as.character(),
stop_ts=as.character())
regex_text <- "21/08/2014 22:58CONTENT_ACCESS.preparing21/08/2014 23:00EXE_IN.preparing21/08/2014 23:07CONTENT_ACCESS.preparing21/08/2014 23:08CONTENT_ACCESS.preparing21/08/2014 23:12EXE_CO.preparing21/08/2014 23:28EXE_IN.preparing21/08/2014 23:29CONTENT_ACCESS.preparing21/08/2014 23:30CONTENT_ACCESS.preparing"
regex_pattern <- "(\\d{2}\\/\\d{2}\\/\\d{4}\\s\\d{2}\\:\\d{2})(CONTENT_ACCESS\\.preparing)"
if (grepl(regex_pattern, regex_text, perl=TRUE)) {
sm <- str_match_all(regex_text, regex_pattern )
#Get the first and last timestamp in matched sequence
r_start_ts <- sapply(sm, function(x) x[1, 2])
r_stop_ts <- sapply(sm, function(x) x[sapply(sm,nrow), sapply(sm, ncol) - 1])
eventised_session[nrow(eventised_session)+1,] <- c(r_start_ts, r_stop_ts)
print(eventised_session)
}
答案 0 :(得分:1)
您是否正在寻找类似的东西?
vec = c("21/08/2014 23:07CONTENT_ACCESS.preparing", "21/08/2014 23:08CONTENT_ACCESS.preparing", "21/08/2014 23:12EXE_CO.preparing", "21/08/2014 23:28EXE_IN.preparing", "21/08/2014 23:29CONTENT_ACCESS.preparing", "21/08/2014 23:30CONTENT_ACCESS.preparing")
setNames(data.frame(gsub("[a-z].*","",t(matrix(grep("CONTENT_ACCESS.preparing",vec,value=T),2)),T)),c("Start_ts","Stop_ts"))
Start_ts Stop_ts
1 21/08/2014 23:07 21/08/2014 23:08
2 21/08/2014 23:29 21/08/2014 23:30
如果您的代码中包含regex_text
,则可以执行以下操作:
regex_text <- "21/08/2014 23:07CONTENT_ACCESS.preparing21/08/2014 23:08CONTENT_ACCESS.preparing21/08/2014 23:12EXE_CO.preparing21/08/2014 23:28EXE_IN.preparing21/08/2014 23:29CONTENT_ACCESS.preparing21/08/2014 23:30CONTENT_ACCESS.preparing"
a = gsub(".*?(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2})CONTENT_ACCESS[.]preparing.*?","\\1\n",regex_text)
setNames(data.frame(matrix( unlist(strsplit(a,"\n")),ncol = 2,byrow = T)),c("start_ts","stop_ts"))
start_ts stop_ts
1 21/08/2014 23:07 21/08/2014 23:08
2 21/08/2014 23:29 21/08/2014 23:30
a = gsub("((.*?preparing){2})","\\1\n ",regex_text)
b = read.table(text=gsub("(?<=preparing)(?=\\d+)","|",a,perl=T),sep="|",fill=T,h=F)
d = sub("^(?:(?!CONTENT).)*$|(^.*)CONTENT.*$","\\1",as.matrix(b),perl=T)
subset(data.frame(start_ts = d[,1],stop_ts = ifelse(d[,2]=="",d[,1],d[,2])),start_ts!="")
start_ts stop_ts
1 21/08/2014 22:58 21/08/2014 22:58
2 21/08/2014 23:07 21/08/2014 23:08
4 21/08/2014 23:29 21/08/2014 23:30
答案 1 :(得分:0)
您可以创建要查找的匹配实例的列表并创建循环
for (match in match_list){
#match as in your code
#add into db as in your code
}