我有一个包含字符串的数据框,我想从中删除停用词。我试图避免使用tm
包,因为它是一个大型数据集,而tm
似乎运行得有点慢。我使用的是tm
stopword
词典。
library(plyr)
library(tm)
stopWords <- stopwords("en")
class(stopWords)
df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."
head(df1)
df1$string1 <- tolower(df1$string1)
str1 <- strsplit(df1$string1[5], " ")
> !(str1 %in% stopWords)
[1] TRUE
这不是我正在寻找的答案。我正在尝试获取不在stopWords
向量中的单词的向量或字符串。
我做错了什么?
答案 0 :(得分:12)
您没有正确访问列表,并且您没有从%in%
的结果中获取元素(它给出了逻辑向量为TRUE / FALSE)。你应该这样做:
unlist(str1)[!(unlist(str1) %in% stopWords)]
(或)
str1[[1]][!(str1[[1]] %in% stopWords)]
对于整个data.frame
df1,您可以执行以下操作:
'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
# [[1]]
# [1] "string" "string."
#
# [[2]]
# [1] "string" "slightly" "string."
#
# [[3]]
# [1] "string" "string."
#
# [[4]]
# [1] "string" "slightly" "shorter" "string."
#
# [[5]]
# [1] "string" "string" "strings."
答案 1 :(得分:2)
首先。如果str1
是vector:
lapply
或使用str1
!(unlist(str1) %in% words)
#> [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
二。复杂的解决方案:
string <- c("This string is a string.",
"This string is a slightly longer string.",
"This string is an even longer string.",
"This string is a slightly shorter string.",
"This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
stopifnot(is.character(string), is.character(words))
spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string." "string slightly longer string." "string even longer string."
#> [4] "string slightly shorter string." "string longest string strings."
答案 2 :(得分:0)
在我从事类似工作时,遇到了这个问题。
尽管已经回答了这个问题,但我只是想写一个简洁的代码行来解决我的问题-这将有助于直接消除数据框中的所有停用词:
df1$string1 <- unlist(lapply(df1$string1, function(x) {paste(unlist(strsplit(x, " "))[!(unlist(strsplit(x, " ")) %in% stopWords)], collapse=" ")}))