我要从字符向量+, ., -, /
我知道也有类似的问题,但是,我尝试了相应的解决方案,但没有得到想要的答案。
当前的字符向量item
具有许多我想删除的圆括号和方括号。
以下是item
变量的示例:
item
BOYS S SLV MOCK LAYER TEE
BOYS S SLV PRINTED TEE
CHEAP MONDAY TEE (SAD TOP)
LOPPAN S SLV TEE (STRIPE)
FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE
LST-[REVISED]
最终,我想针对变量item
生成唯一的词频。
word freq
boys 2
s 3
slv 4
tee 4
tee-zebralogo 1
mock 1
layer 1
printed 2
cheap 1
... ...
这是我目前使用tm
软件包的代码:
item_names <- df1$item
item_names <- tolower(item_names)
item_names <- removePunctuation(item_names)
myCorpus <- Corpus(VectorSource(item_names))
myTDM <- TermDocumentMatrix(myCorpus)
findFreqTerms(myTDM)
m <- as.matrix(myTDM)
v <- sort(rowSums(m),decreasing=TRUE)
df4 <- data.frame(word = names(v),freq=v)
从上面的代码中,我能够减少所有标点符号,但是,我想保留上述四个标点符号,但是我不能令人满意地做到这一点。
我也尝试过R的基本功能:
item_names <- df1$item
item_names <- tolower(item_names)
item_names <- gsub(pattern = "[^[:alnum:][:space:][-\\.\\+\\/]]", "",
item_names)
item_names <- gsub(pattern = "\\s+", " ", item_names)
table(do.call(c, lapply(item_names, function(x) unlist(strsplit(x, " ")))))
df4 <- as.data.frame(table(do.call(c, lapply(item_names, function(x)
unlist(strsplit(x, c(" ")))))))
View(df4)
上面的直接代码似乎无效,因为它仍然无法消除标点符号,例如(
和)
。
最终,我想删除除+, ., -, /
以外的所有标点符号,并使用上述两个选项生成词频。
任何帮助将不胜感激。
答案 0 :(得分:2)
举个例子:
item_names <- c(
"BOYS S SLV MOCK LAYER TEE",
"BOYS S SLV PRINTED TEE",
"CHEAP MONDAY TEE (SAD TOP)",
"LOPPAN S SLV TEE (STRIPE)",
"FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE",
"LST-[REVISED]",
"(lot of round and square brackets that I would like to get rid [of]. )"
)
我们可以做到:
gsub("([-\\.\\+\\/])|[[:punct:]]", "\\1", item_names)
[1] "BOYS S SLV MOCK LAYER TEE"
[2] "BOYS S SLV PRINTED TEE"
[3] "CHEAP MONDAY TEE SAD TOP"
[4] "LOPPAN S SLV TEE STRIPE"
[5] "FREE PRINTED SLV LESS TEE-ZEBRALOGO SNAKE"
[6] "LST-REVISED"
[7] "lot of round and square brackets that I would like to get rid of. "