R正则表达式替换除句子标记,撇号和连字符之外的所有标点符号

时间:2015-08-06 17:44:01

标签: regex r

我正在寻找一种方法来标记R中句子的开头和结尾。为此,我想删除所有标点符号,除了句子标记的结尾,如句号,感叹号,询问标记和连字符,我想用标记***代替。同时,我还想保留包含撇号的单词。举一个具体的例子,给出这个字符串:

txt <- "We have examined all the possibilities, however we have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"

期望的结果将是

txt <- "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

我无法使用正则表达式来表达这一点。任何提示都非常感谢。

2 个答案:

答案 0 :(得分:2)

您可以使用gsub。

latex
  

我想删除所有标点符号,除了句子标记的结尾,如句号,感叹号,询问标记和连字符。

> txt <- "We have examined all the possibilities, however he have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"
> gsub("[-.?!]", "<S>", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion <S> however we keep and open mind<S> Have you considered any other approach<S> Haven't you<S>"
> gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"
  

我想用标记***代替。与此同时,我也想保留包含撇号的单词。

gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T)

答案 1 :(得分:0)

您可以使用两个正则表达式来完成此操作。首先,您可以使用字符类删除您不想要的字符:

[,.]
  ^--- Whatever you want to remove, put it here

并使用空替换字符串。

然后,您可以像这样使用第二个正则表达式:

[?!-]
  ^--- Add characters you want to replace here

使用替换字符串:

<S>

<强> Working demo