我有一个这样的字符串:
Received @ 10/10/2014 02:29:55 a.m. Changed status: 'processing' @ 10/10/2014 02:40:20 a.m. Changed status: 'processed' @ 10/10/2014 02:40:24 a.m.
我需要使用某些规则“解析”此字符串:
Received
日期和时间Changed status:
开头,并以日期和时间结束
可以有任意数量的Changed status:
块(至少1个),状态可以变化
我需要做的是:
示例:
[Received @ 10/10/2014 02:29:55 a.m.], [Changed status: 'processing' @ 10/10/2014 02:40:20 a.m.], [Changed status: 'processed' @ 10/10/2014 02:40:24 a.m.]
对于上面的例子,我需要的是这样的:
Received | NULL | 10/10/2014 02:29:55 am
Changed status | processing | 10/10/2014 02:40:20 am
Changed status | processed | 10/10/2014 02:40:20 am
我认为第二步很容易(每个块可以使用@
和:
作为分隔符进行拆分),但第一步是让我脱掉头发。有没有办法用正则表达式做这种事情?
我尝试了一些方法(比如Received|Changed.*[ap].m.
),但它不起作用(正则表达式的评估总是返回完整的字符串)。
我想在R中执行此操作:
R内置了对正则表达式的支持,因此我一直在考虑解决这个问题。
任何帮助将不胜感激。老实说,我迷失在这里(但我会继续努力......如果我找到让我更接近解决方案的步骤,我会编辑我的帖子)
答案 0 :(得分:3)
这里有可能你可以投入一个功能。在您发布的字符串中,重要信息似乎由两个空格分隔,这很好。基本上我所做的就是尝试让所有相关的线条均匀分成正确的长度。
x <- "Received @ 10/10/2014 02:29:55 a.m. Changed status: 'processing' @ 10/10/2014 02:40:20 a.m. Changed status: 'processed' @ 10/10/2014 02:40:24 a.m."
s <- strsplit(gsub("['.]", "", x), " ")[[1]]
s[g] <- sub("(\\D) ", "\\1: ", s[g <- grep("Received", s)])
do.call(rbind, strsplit(s, " @ |: "))
# [,1] [,2] [,3]
# [1,] "Received" "" "10/10/2014 02:29:55 am"
# [2,] "Changed status" "processing" "10/10/2014 02:40:20 am"
# [3,] "Changed status" "processed" "10/10/2014 02:40:24 am"
我没有"NULL"
,因为我认为你的意思是你想要一个空角色。无论如何,NULL
都不会显示在数据框中。
答案 1 :(得分:3)
以下是基于strapplyc
的简短解决方案。 strapplyc
将正则表达式与输入字符串s
匹配,将匹配提取到正则表达式的带括号的部分,但非捕获的(?:...)
除外。
pat
中有3对捕获括号。第一个匹配收到或更改状态。然后我们可选地匹配冒号,空格,单引号,零个或多个非单引号字符和另一个引号。两个引号之间的部分是第二个捕获的字符串。然后我们匹配空格,@,空格和日期/时间字符串。捕获日期/时间字符串。
最后matrix
用于将其重塑为3列:
library(gsubfn)
pat <- "(Received|Changed status)(?:: '([^']*)')? @ (../../.... ..:..:.. ....)"
matrix(strapplyc(s, pat, simplify = TRUE), nc = 3, byrow = TRUE)
,并提供:
[,1] [,2] [,3]
[1,] "Received" "" "10/10/2014 02:29:55 a.m."
[2,] "Changed status" "processing" "10/10/2014 02:40:20 a.m."
[3,] "Changed status" "processed" "10/10/2014 02:40:24 a.m."
更新:简化。修改后的输出也是有问题的。
答案 2 :(得分:2)
tmp <- "Received @ 10/10/2014 02:29:55 a.m. Changed status: 'processing' @ 10/10/2014 02:40:20 a.m. Changed status: 'processed' @ 10/10/2014 02:40:24 a.m."
tmp1 <- strsplit(gsub('Received', 'Received:', tmp), '\\s{2}', perl = TRUE)
do.call(rbind, strsplit(tmp1[[1]], '@ |: '))
# [,1] [,2] [,3]
# [1,] "Received" "" "10/10/2014 02:29:55 a.m."
# [2,] "Changed status" "'processing' " "10/10/2014 02:40:20 a.m."
# [3,] "Changed status" "'processed' " "10/10/2014 02:40:24 a.m."
答案 3 :(得分:1)
我假设您已经在data.frame中获取了数据,并且您希望在数据框中的许多行上执行此操作。我称之为data.frame&#34;数据&#34;,以及我要做的事情,尽管也许其他人可以让这更优雅:
Split <- str_split(Data$String, "@") # Make a list with your string split by "@"
Data$Received <- NA
Data$Processing <- NA
Data$Processed <- NA
for (i in 1:nrow(Data)){
Data$Received[i] <- str_sub(Split[[i]][2], 2, 24) # Extract the date received, etc.
Data$Processing[i] <- str_sub(Split[[i]][3], 2, 24)
Data$Processed[i] <- str_sub(Split[[i]][4], 2, 24)
}
Data$Received <- mdy_hms(Data$Received) # Use lubridate to convert it to POSIX format
Data$Processing <- mdy_hms(Data$Processing)
Data$Processed <- mdy_hms(Data$Processed)
这为您提供了三列所需的日期和时间。