Question

我在R中使用tm包。我想删除此文本中的所有标点符号，除了微笑。

data <- c("conflict need resolved :<. turned conversation exchange ideas richer environment one tricky concepts :D , �conflict� always top business agendas :>. maybe different ideas/opinions different :) " )

我试过了

library(tm) data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)

即删除所有标点符号，包括微笑，作为输出

data <- conflict need resolved turned conversation exchange ideas richer environment one tricky concepts conflict always top business agendas maybe different ideas opinions different

我需要的时候，

data <- conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :)

建议请。

Answer 1

我会写一个笑脸词典，用文字替换它们，删除标点符号，然后将它们替换回来。

# Make the dictionary. You need to make sure the strings are not in the text, which can be tested with something like stri_match(str=data,regex = smiles$r)
smiles <- data.frame(s=c(":<",":>",":)",":(",";)",":D"),
                     r=c("unhappyBracket","happyBracket","happyParen","unhappyParen","winkSmiley","DSmiley"))

library(stringi)
## replace smiley with text
data <- stri_replace_all_fixed(data,pattern = smiles$s,replacement = smiles$r,vectorize_all = FALSE)
## remove punctuation
data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)
## replace text-smiley with punctuation smiley
data <- stri_replace_all_fixed(data,pattern = smiles$r,replacement = smiles$s,vectorize_all = FALSE)

请注意，如果笑脸对您的分析很重要，您应该将它们留作单词，因为它们更容易以这种方式操作。此外，您可能需要查看tm::removePunctuation()和tm::tm_map来处理标点符号删除步骤。

删除除微笑之外的标点符号 - R，tm包

1 个答案: