我在R中使用tm包。我想删除此文本中的所有标点符号,除了微笑。
data <- c("conflict need resolved :<. turned conversation exchange ideas richer environment one tricky concepts :D , �conflict� always top business agendas :>. maybe different ideas/opinions different :) " )
我试过了
library(tm)
data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)
即删除所有标点符号,包括微笑,作为输出
data <- conflict need resolved turned conversation exchange ideas richer environment one tricky concepts conflict always top business agendas maybe different ideas opinions different
我需要的时候,
data <- conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :)
建议请。
答案 0 :(得分:3)
我会写一个笑脸词典,用文字替换它们,删除标点符号,然后将它们替换回来。
# Make the dictionary. You need to make sure the strings are not in the text, which can be tested with something like stri_match(str=data,regex = smiles$r)
smiles <- data.frame(s=c(":<",":>",":)",":(",";)",":D"),
r=c("unhappyBracket","happyBracket","happyParen","unhappyParen","winkSmiley","DSmiley"))
library(stringi)
## replace smiley with text
data <- stri_replace_all_fixed(data,pattern = smiles$s,replacement = smiles$r,vectorize_all = FALSE)
## remove punctuation
data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)
## replace text-smiley with punctuation smiley
data <- stri_replace_all_fixed(data,pattern = smiles$r,replacement = smiles$s,vectorize_all = FALSE)
请注意,如果笑脸对您的分析很重要,您应该将它们留作单词,因为它们更容易以这种方式操作。此外,您可能需要查看tm::removePunctuation()
和tm::tm_map
来处理标点符号删除步骤。