具有正面lookhead的正则表达式仍然使用strsplit()将字符串拆分到错误的位置

时间:2018-01-26 15:22:52

标签: r regex pcre regex-lookarounds strsplit

我正在尝试拆分包含日期时间指示器前面消息的字符向量。

我在考虑将strsplit()与正则表达式perl = TRUE

一起使用

以下是一些示例数据:

TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")

这是我到目前为止所尝试的:

Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
Cut

根据this website,正则表达式应该在日期时间指示符前面剪切字符串。但是,我得到的结果看起来像这样,第一个字符被切断了:

 [1] "0"                                                                                   
 [2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
 [3] "0"                                                                                   
 [4] "5.10.17, 09:27 - Person One: I could bring some beer\n"                              
 [5] "0"                                                                                   
 [6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
 [7] "0"                                                                                   
 [8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
 [9] "0"                                                                                   
[10] "5.10.17, 09:27 - Person Two: ???"                                                                   
[11] "0"                                                                                   
[12] "5.10.17, 09:28 - Person Two: You guys have history?\n"                               
[13] "0"                                                                                   
[14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

结果应该

 [1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                                                                                   
 [2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                         
 [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"                                                                                   
 [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
 [5] "05.10.17, 09:27 - Person Two: ???\n"                                                                                   
 [6] "05.10.17, 09:28 - Person Two: You guys have history?\n"  
 [7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n" 

注意:我无法在换行符指示符处拆分数据,因为某些消息包含消息中间的一个或多个消息。

3 个答案:

答案 0 :(得分:2)

\n后跟日期时,您只需要创建一个拆分模式。

 strsplit(gsub("(.*?\\n)(\\d+[.]\\d+[.]\\d+)","\\1SPLITHERE\\2",TEST),"SPLITHERE")
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
[5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

您也可以使用基础r中的rematches

 regmatches(TEST,gregexpr(".*?\\n",TEST))
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
[5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

答案 1 :(得分:1)

您可以在积极前瞻之前添加白色字符类\\s

我稍微更改了您的示例,以使其更准确地匹配您的问题(即在标题中添加\ n)

> TEST <- c("05.10.17, 09:26 - Person One: How about\n we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
> unlist(strsplit(TEST,"\\s(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))

## [1] "05.10.17, 09:26 - Person One: How about\n we chill on sunday"                         
## [2] "05.10.17, 09:27 - Person One: I could bring some beer"                                
## [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards"    
## [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-"                                
## [5] "05.10.17, 09:27 - Person Two: ???"                                                    
## [6] "05.10.17, 09:28 - Person Two: You guys have history?"                                 
## [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

答案 2 :(得分:1)

strsplit(TEST, '(?<=\\\n|^)(0)',perl=T)[[1]][2:7]