我是R中的文本挖掘新手。我想从我的数据框的列中删除停用词(即提取关键字),并将这些关键字放入新列中。
我试图制作一个语料库,但它没有帮助我。
df$C3
是我现在拥有的。我想添加列df$C4
,但我无法让它工作。
df <- structure(list(C3 = structure(c(3L, 4L, 1L, 7L, 6L, 9L, 5L, 8L,
10L, 2L), .Label = c("Are doing good", "For the help", "hello everyone",
"hope you all", "I Hope", "I need help", "In life", "It would work",
"On Text-Mining", "Thanks"), class = "factor"), C4 = structure(c(2L,
4L, 1L, 6L, 3L, 7L, 5L, 9L, 8L, 3L), .Label = c("doing good",
"everyone", "help", "hope", "Hope", "life", "Text-Mining", "Thanks",
"work"), class = "factor")), .Names = c("C3", "C4"), row.names = c(NA,
-10L), class = "data.frame")
head(df)
# C3 C4
# 1 hello everyone everyone
# 2 hope you all hope
# 3 Are doing good doing good
# 4 In life life
# 5 I need help help
# 6 On Text-Mining Text-Mining
答案 0 :(得分:0)
这是我在R中做的第一件事,它可能不是最好的,但是像:
library(stringi)
df2 <- do.call(rbind, lapply(stop$stop, function(x){
t <- data.frame(c1= df[,1], c2 = df[,2], words = stri_extract(df[,3], coll=x))
t<-na.omit(t)}))
示例数据:
df = data.frame(c1 = c(108,20,99), c2 = c(1,3,7), c3 = c("hello everyone", "hope you all", "are doing well"))
stop = data.frame(stop = c("you", "all"))
然后,您可以使用:
重新塑造df2
df2 = data.frame(c1 = unique(u$c1), c2 = unique(u$c2), words = paste(u$words, collapse= ','))
然后cbind
df
和df2
答案 1 :(得分:0)
此解决方案使用包dplyr
和tidytext
。
library(dplyr)
library(tidytext)
# subset of your dataset
dt = data.frame(C1 = c(108,20, 999, 52, 400),
C2 = c(1,3,7, 6, 9),
C3 = c("hello everyone","hope you all","Are doing good","in life","I need help"), stringsAsFactors = F)
# function to combine words (by pasting one next to the other)
f = function(x) { paste(x, collapse = " ") }
dt %>%
unnest_tokens(word, C3) %>% # split phrases into words
filter(!word %in% stop_words$word) %>% # keep appropriate words
group_by(C1, C2) %>% # for each combination of C1 and C2
summarise(word = f(word)) %>% # combine multiple words (if there are multiple)
ungroup() # forget the grouping
# # A tibble: 2 x 3
# C1 C2 word
# <dbl> <dbl> <chr>
# 1 20 3 hope
# 2 52 6 life
这里的问题是&#34;停用词&#34;内置在该程序包中过滤掉您想要保留的一些单词。因此,您必须添加手动步骤,以指定需要包含的单词。你可以这样做:
dt %>%
unnest_tokens(word, C3) %>% # split phrases into words
filter(!word %in% stop_words$word | word %in% c("everyone","doing","good")) %>% # keep appropriate words
group_by(C1, C2) %>% # for each combination of C1 and C2
summarise(word = f(word)) %>% # combine multiple words (if there are multiple)
ungroup() # forget the grouping
# # A tibble: 4 x 3
# C1 C2 word
# <dbl> <dbl> <chr>
# 1 20 3 hope
# 2 52 6 life
# 3 108 1 everyone
# 4 999 7 doing good
答案 2 :(得分:0)
我会使用tm
- 包。它有一个带有英语停用词的小字典。您可以使用gsub()
library(tm)
prep <- tolower(paste(" ", df$C3, " "))
regex_pat <- paste(stopwords("en"), collapse = " | ")
df$C4 <- gsub(regex_pat, " ", prep)
df$C4 <- gsub(regex_pat, " ", df$C4)
# C3 C4
# 1 hello everyone hello everyone
# 2 hope you all hope
# 3 Are doing good good
# 4 In life life
# 5 I need help need help
您可以轻松添加c("hello", "othernewword", stopwords("en"))
等新词。