我正在尝试使用stringi
包拆分分隔符(可能重复分隔符)但保留分隔符。这类似于我之前问过的问题:R split on delimiter (split) keep the delimiter (split)但是分隔符可以重复。我不认为base strsplit
可以处理这种类型的正则表达式。 stringi
包可以,但我无法弄清楚如果有重复,如何格式化正则表达式在分隔符上拆分,也不能在字符串的末尾留下空字符串。
Base R解决方案,stringr,stringi等解决方案都受到欢迎。
之后发生的问题是因为我在*
上使用了贪婪的\\s
但是这个空间并没有占用空间,所以我只想把它留在:
MWE
text.var <- c("I want to split here.But also||Why?",
"See! Split at end but no empty.",
"a third string. It has two sentences"
)
library(stringi)
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")
#结果
## [[1]]
## [1] "I want to split here." "But also|" "|" "Why?"
## [5] ""
##
## [[2]]
## [1] "See!" "Split at end but no empty." ""
##
## [[3]]
## [1] "a third string." "It has two sentences"
#Desired Outcome
## [[1]]
## [1] "I want to split here." "But also||" "Why?"
##
## [[2]]
## [1] "See!" "Split at end but no empty."
##
## [[3]]
## [1] "a third string." "It has two sentences"
答案 0 :(得分:8)
使用strsplit
strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
#[[1]]
#[1] "I want to split here." "But also||" "Why?"
#[[2]]
#[1] "See!" "Split at end but no empty."
#[[3]]
#[1] "a third string." "It has two sentences"
或
library(stringi)
stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
#[[1]]
#[1] "I want to split here." "But also||" "Why?"
#[[2]]
#[1] "See!" "Split at end but no empty."
#[[3]]
#[1] "a third string." "It has two sentences"
答案 1 :(得分:6)
只需使用找到字符间位置的模式:(1) 前面有?.!|
之一; (2)不后跟?.!|
之一。点击\\s*
以匹配并吃掉任意数量的连续空格字符,你就可以去了。
## (look-behind)(look-ahead)(spaces)
strsplit(text.var, "(?<=([?.!|]))(?!([?.!|]))\\s*", perl=TRUE)
# [[1]]
# [1] "I want to split here." "But also||" "Why?"
#
# [[2]]
# [1] "See!" "Split at end but no empty."
#
# [[3]]
# [1] "a third string." "It has two sentences"