删除除某些标点符号以外的所有字符以生成单词频率?

时间:2018-07-26 13:25:48

标签: r gsub tm punctuation

我要从字符向量+, ., -, /

中删除除这四个特定标点字符以外的所有标点符号

我知道也有类似的问题,但是,我尝试了相应的解决方案,但没有得到想要的答案。

当前的字符向量item具有许多我想删除的圆括号和方括号。

以下是item变量的示例:

item
BOYS S SLV MOCK LAYER TEE
BOYS S SLV PRINTED TEE
CHEAP MONDAY TEE (SAD TOP)
LOPPAN S SLV TEE (STRIPE)
FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE
LST-[REVISED]

最终,我想针对变量item生成唯一的词频。

word          freq
boys          2
s             3
slv           4
tee           4
tee-zebralogo 1
mock          1
layer         1
printed       2
cheap         1
...           ...

这是我目前使用tm软件包的代码:

item_names <- df1$item
item_names <- tolower(item_names)
item_names <- removePunctuation(item_names)
myCorpus <- Corpus(VectorSource(item_names))
myTDM <- TermDocumentMatrix(myCorpus)
findFreqTerms(myTDM)

m <- as.matrix(myTDM)
v <- sort(rowSums(m),decreasing=TRUE)
df4 <- data.frame(word = names(v),freq=v)

从上面的代码中,我能够减少所有标点符号,但是,我想保留上述四个标点符号,但是我不能令人满意地做到这一点。

我也尝试过R的基本功能:

item_names <- df1$item
item_names <- tolower(item_names)
item_names <- gsub(pattern = "[^[:alnum:][:space:][-\\.\\+\\/]]", "", 
item_names)
item_names <- gsub(pattern = "\\s+", " ", item_names)

table(do.call(c, lapply(item_names, function(x) unlist(strsplit(x, " ")))))
df4 <- as.data.frame(table(do.call(c, lapply(item_names, function(x) 
unlist(strsplit(x, c(" ")))))))
View(df4)

上面的直接代码似乎无效,因为它仍然无法消除标点符号,例如()

最终,我想删除除+, ., -, /以外的所有标点符号,并使用上述两个选项生成词频。

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:2)

举个例子:

item_names <- c(
  "BOYS S SLV MOCK LAYER TEE",
  "BOYS S SLV PRINTED TEE",
  "CHEAP MONDAY TEE (SAD TOP)",
  "LOPPAN S SLV TEE (STRIPE)",
  "FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE",
  "LST-[REVISED]",
  "(lot of round and square brackets that I would like to get rid [of]. )"
)

我们可以做到:

gsub("([-\\.\\+\\/])|[[:punct:]]", "\\1", item_names)
[1] "BOYS S SLV MOCK LAYER TEE"                                         
[2] "BOYS S SLV PRINTED TEE"                                            
[3] "CHEAP MONDAY TEE SAD TOP"                                          
[4] "LOPPAN S SLV TEE STRIPE"                                           
[5] "FREE PRINTED SLV LESS TEE-ZEBRALOGO  SNAKE"                        
[6] "LST-REVISED"                                                       
[7] "lot of round and square brackets that I would like to get rid of. "