拆分保持重复分隔符

时间:2014-10-22 14:19:57

标签: regex r string stringi

我正在尝试使用stringi包拆分分隔符(可能重复分隔符)但保留分隔符。这类似于我之前问过的问题:R split on delimiter (split) keep the delimiter (split)但是分隔符可以重复。我不认为base strsplit可以处理这种类型的正则表达式。 stringi包可以,但我无法弄清楚如果有重复,如何格式化正则表达式在分隔符上拆分,也不能在字符串的末尾留下空字符串。

Base R解决方案,stringr,stringi等解决方案都受到欢迎。

之后发生的问题是因为我在*上使用了贪婪的\\s但是这个空间并没有占用空间,所以我只想把它留在:

MWE

text.var <- c("I want to split here.But also||Why?",
   "See! Split at end but no empty.",
   "a third string.  It has two sentences"
)

library(stringi)   
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")

#结果

## [[1]]
## [1] "I want to split here." "But also|"     "|"          "Why?"                 
## [5] ""                     
## 
## [[2]]
## [1] "See!"       "Split at end but no empty." ""                          
## 
## [[3]]
## [1] "a third string."      "It has two sentences"

#Desired Outcome

## [[1]]
## [1] "I want to split here." "But also||"                     "Why?"                                  
## 
## [[2]]
## [1] "See!"         "Split at end but no empty."                         
## 
## [[3]]
## [1] "a third string."      "It has two sentences"

2 个答案:

答案 0 :(得分:8)

使用strsplit

 strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
 #[[1]]
 #[1] "I want to split here." "But also||"            "Why?"                 

 #[[2]]
 #[1] "See!"                       "Split at end but no empty."

 #[[3]]
 #[1] "a third string."      "It has two sentences"

 library(stringi)
 stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
 #[[1]]
 #[1] "I want to split here." "But also||"            "Why?"                 

 #[[2]]
 #[1] "See!"                       "Split at end but no empty."

 #[[3]]
 #[1] "a third string."      "It has two sentences"

答案 1 :(得分:6)

只需使用找到字符间位置的模式:(1) 前面有?.!|之一; (2)后跟?.!|之一。点击\\s*以匹配并吃掉任意数量的连续空格字符,你就可以去了。

##                  (look-behind)(look-ahead)(spaces)
strsplit(text.var, "(?<=([?.!|]))(?!([?.!|]))\\s*", perl=TRUE)
# [[1]]
# [1] "I want to split here." "But also||"            "Why?"                 
# 
# [[2]]
# [1] "See!"                       "Split at end but no empty."
# 
# [[3]]
# [1] "a third string."      "It has two sentences"