我知道如何单独删除标点并保留撇号:
gsub( "[^[:alnum:]']", " ", db$text )
或如何使用tm包保持字内短划线:
removePunctuation(db$text, preserve_intra_word_dashes = TRUE)
但我无法找到同时做两件事的方法。例如,如果我的原始句子是:
"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
我希望它是:
"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
当然,会有额外的空格,但我可以稍后删除它们。
我将非常感谢你的帮助。
答案 0 :(得分:9)
gsub("[^[:alnum:]['-]", " ", db$text)
## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
答案 1 :(得分:3)
我喜欢David Arenberg's
回答。如果您需要其他方式,可以尝试:
library(qdap)
text <- "Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
gsub("/", " ",strip(text, char.keep=c("-","/"), apostrophe.remove=F,lower.case=F))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
或
library(gsubfn)
clean(gsubfn("[[:punct:]]", function(x) ifelse(x=="'","'",ifelse(x=="-","-"," ")),text))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
clean
来自qdap
。用于删除转义的字符和空格