text="stack overflow... is a popular website."
我想从单词中分隔标点符号。输出应为:
"stack overflow ... is a popular website . "
当然,命令gsub("\\.", " \\. ", text, fixed = FALSE)
会返回:
"stack overflow . . . is a popular website . "
因为它没有区分句点和省略号(暂停点)。简而言之,当文本中一起找到三个句点时,R应将它们视为单个标点符号。
答案 0 :(得分:3)
我认为非外观方法会更有效率和可读性:
text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\\.+)[[:space:]]*", " \\1 ", text)
## => [1] "stack overflow ... is a popular website . "
请参阅IDEONE demo
我更新了帖子,因为在标点符号之前和之后需要空格。
[[:space:]]*
周围的(\\.+)
匹配零个或多个空格,(\\.+)
将匹配一个或多个句点。 (...)
形成一个捕获组,其值存储在我们可以使用替换模式中的\1
反向引用访问的编号缓冲区#1中。因此,\1
将替换为模式捕获的时段。捕获比使用外观更有效,因为在当前位置之前/之后检查文本没有开销。
现在,如果您需要处理所有标点符号,请使用 [[:punct:]]
:
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
请参阅R regex help:
[:punct:]
标点字符:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "
为避免匹配带连字符的字词,您可以匹配并跳过字边界所包围的-
:
text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\\b-\\b(*SKIP)(*F)|\\s*(\\p{P}+)\\s*", " \\1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
请参阅demo
答案 1 :(得分:3)
在这一批评论之后,这个正则表达式最有可能满足您的需求:
(?:\b| )([.,:;!]+)(?: |\b)
要在R中使用它,反斜杠必须加倍。
所以我们最终得到:
text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
'aaa...',
'AAA...B"B"B',
'AA .BBB #unlikely to happen but managed anyway')
> gsub('(?:\\b| )([.,:;!]+)(?: |\\b)',' \\1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "
[3] "AAA ... B\"B\"B"
[4] "AA . BBB #unlikely to happen but managed anyway"
答案 2 :(得分:2)
尝试
gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "
gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "
gsub("(?<=\\.)(?=$|\\w)|(?<=\\w)(?=\\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"