删除除撇号和R中的字内短划线之外的标点符号

时间:2014-07-03 09:59:43

标签: string r text

我知道如何单独删除标点并保留撇号:

gsub( "[^[:alnum:]']", " ", db$text )  

或如何使用tm包保持字内短划线:

removePunctuation(db$text, preserve_intra_word_dashes = TRUE)

但我无法找到同时做两件事的方法。例如,如果我的原始句子是:

"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"

我希望它是:

"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

当然,会有额外的空格,但我可以稍后删除它们。

我将非常感谢你的帮助。

2 个答案:

答案 0 :(得分:9)

使用character classes

gsub("[^[:alnum:]['-]", " ", db$text)

## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

答案 1 :(得分:3)

我喜欢David Arenberg's回答。如果您需要其他方式,可以尝试:

library(qdap)

text <- "Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"

gsub("/", " ",strip(text, char.keep=c("-","/"), apostrophe.remove=F,lower.case=F))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

library(gsubfn)
 clean(gsubfn("[[:punct:]]", function(x) ifelse(x=="'","'",ifelse(x=="-","-"," ")),text))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

clean来自qdap。用于删除转义的字符和空格