我知道使用R执行正则表达式的基本知识。但是在这里我有一个像:
这样的文件** [2016-04-28 14:00:06,603] ,,,,, SERVICE_ID = 441,DEBUG,DBSEntryServlet,DBSEntryServlet:delegateToRequestManager :: SERVICE_ID = 541,SERVICE_ID = 9981
[2016-04-28 14:00:06,608] ,,,,,, DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID = 00234,SERVICE_ID = 11134,IMD = 6767 **
我想提取时间戳以及该行中的所有SERVICE_ID。
所以,我的预期输出是:
[2016-04-28 14:00:06,603] SERVICE_ID = 441 SERVICE_ID = 541 SERVICE_ID = 9981
[2016-04-28 14:00:06,608] SERVICE_ID = 00234 SERVICE_ID = 11134
我尝试的代码只提取了一个SERVICE_ID。
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
答案 0 :(得分:0)
我们将2个或更多,
替换为" "
以获得&#39; str2&#39;然后使用正则表达式,我们匹配一个或多个空格(\\s+
)跟随]
)后跟字符(.*
)直到字符串结尾,将其替换为""
,以便我们可以提取[2016-04..,03]
部分。从&#39; str2&#39;中,我们提取子串&#34; SERVICE_ID =&#34;接着将数字(\\d+
)加入list
,将paste
加在一起,最后paste
加上&#39; str3&#39;。
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
答案 1 :(得分:0)
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
输出: