假设我有一个字符串,例如以下内容。
x <- 'The world is at end. What do you think? I am going crazy! These people are too calm.'
我只需要在标点符号!?.
上进行分割并跟随空格并保留标点符号。
这会删除标点并在分割部分留下前导空格
vec <- strsplit(x, '[!?.][:space:]*')
如何分割留下标点符号的句子?
答案 0 :(得分:14)
您可以使用perl=TRUE
启用PCRE
并使用lookbehind断言。
strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)
正则表达式:
(?<! look behind to see if there is not:
[^!?.] any character except: '!', '?', '.'
) end of look-behind
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times)
答案 1 :(得分:5)
qdap package中的sentSplit
函数仅为此任务创建:
library(qdap)
sentSplit(data.frame(text = x), "text")
## tot text
## 1 1.1 The world is at end.
## 2 2.2 What do you think?
## 3 3.3 I am going crazy!
## 4 4.4 These people are too calm.
答案 2 :(得分:2)
看看this question。像[:space:]
这样的字符类是在括号表达式中定义的,因此您需要将它括在一组括号中。尝试:
vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end" "What do you think"
# [3] "I am going crazy" "These people are too calm"
这摆脱了领先的空间。要保持标点符号,请使用perl = TRUE
:
vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end." "What do you think?"
# [3] "I am going crazy!" "These people are too calm."
答案 3 :(得分:1)
您可以使用字符串替换标点符号后面的空格,例如zzzzz
,然后在该字符串上拆分。
x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think? I am going crazy! These people are too calm.")
strsplit(x, "zzzzz")
替换字符串中的\1
指的是模式的带括号的子表达式。
答案 4 :(得分:1)
从qdap version 1.1.0开始,您可以使用sent_detect
函数,如下所示:
library(qdap)
sent_detect(x)
## [1] "The world is at end." "What do you think?"
## [3] "I am going crazy!" "These people are too calm."