正则表达式gsub R区分省略号和句点

时间:2016-01-13 09:34:51

标签: regex r gsub

text="stack overflow... is a popular website."

我想从单词中分隔标点符号。输出应为:

"stack overflow ... is a popular website . "

当然,命令gsub("\\.", " \\. ", text, fixed = FALSE)会返回:

"stack overflow . . . is a popular website . "因为它没有区分句点和省略号(暂停点)。简而言之,当文本中一起找到三个句点时,R应将它们视为单个标点符号。

3 个答案:

答案 0 :(得分:3)

我认为非外观方法会更有效率和可读性:

text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\\.+)[[:space:]]*", " \\1 ", text)
## => [1] "stack overflow ... is a popular website . "

请参阅IDEONE demo

我更新了帖子,因为在标点符号之前和之后需要空格。

[[:space:]]*周围的(\\.+)匹配零个或多个空格,(\\.+)将匹配一个或多个句点。 (...)形成一个捕获组,其值存储在我们可以使用替换模式中的\1反向引用访问的编号缓冲区#1中。因此,\1将替换为模式捕获的时段。捕获比使用外观更有效,因为在当前位置之前/之后检查文本没有开销。

现在,如果您需要处理所有标点符号,请使用 [[:punct:]]

gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)

请参阅R regex help

  

[:punct:]
  标点字符
  ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

Code demo

text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "

对于词汇的更新

为避免匹配带连字符的字词,您可以匹配并跳过字边界所包围的-

text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\\b-\\b(*SKIP)(*F)|\\s*(\\p{P}+)\\s*", " \\1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "

请参阅demo

答案 1 :(得分:3)

在这一批评论之后,这个正则表达式最有可能满足您的需求:

(?:\b| )([.,:;!]+)(?: |\b)

Demo

要在R中使用它,反斜杠必须加倍。

所以我们最终得到:

text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
    'aaa...',
    'AAA...B"B"B',
    'AA .BBB #unlikely to happen but managed anyway')

> gsub('(?:\\b| )([.,:;!]+)(?: |\\b)',' \\1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "                                                              
[3] "AAA ... B\"B\"B"                                                       
[4] "AA . BBB #unlikely to happen but managed anyway"     

答案 2 :(得分:2)

尝试

gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "

gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "

gsub("(?<=\\.)(?=$|\\w)|(?<=\\w)(?=\\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"