删除除微笑之外的标点符号 - R,tm包

时间:2016-02-15 17:53:44

标签: r text-mining tm punctuation

我在R中使用tm包。我想删除此文本中的所有标点符号,除了微笑。

data <- c("conflict need resolved :<. turned conversation exchange ideas richer environment one tricky concepts :D , �conflict� always top business agendas :>. maybe different ideas/opinions different :) " )

我试过了

library(tm) data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)

即删除所有标点符号,包括微笑,作为输出

data <- conflict need resolved turned conversation exchange ideas richer environment one tricky concepts conflict always top business agendas maybe different ideas opinions different

我需要的时候,

data <- conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :) 

建议请。

1 个答案:

答案 0 :(得分:3)

我会写一个笑脸词典,用文字替换它们,删除标点符号,然后将它们替换回来。

# Make the dictionary. You need to make sure the strings are not in the text, which can be tested with something like stri_match(str=data,regex = smiles$r)
smiles <- data.frame(s=c(":<",":>",":)",":(",";)",":D"),
                     r=c("unhappyBracket","happyBracket","happyParen","unhappyParen","winkSmiley","DSmiley"))

library(stringi)
## replace smiley with text
data <- stri_replace_all_fixed(data,pattern = smiles$s,replacement = smiles$r,vectorize_all = FALSE)
## remove punctuation
data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)
## replace text-smiley with punctuation smiley
data <- stri_replace_all_fixed(data,pattern = smiles$r,replacement = smiles$s,vectorize_all = FALSE)

请注意,如果笑脸对您的分析很重要,您应该将它们留作单词,因为它们更容易以这种方式操作。此外,您可能需要查看tm::removePunctuation()tm::tm_map来处理标点符号删除步骤。