Question

我有一个单词矢量和一个评论载体：

word.list <- c("very", "experience", "glad")

comments  <- c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad")

我想创建一个看起来像

的数据框

df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad"),
               very = c(1,0,0,0,1),
               glad = c(0,1,0,0,1),
               experience = c(1,0,0,1,0))

我有12,000多条评论和20个单词我想这样做。我该如何有效地做到这一点？对于循环？还有其他方法吗？

Answer 1

一种方法是stringi和gdapTools包的组合，即

library(stringi)
library(qdapTools)

mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
#  experience glad very
#1          1    0    1
#2          0    1    0
#3          0    0    0
#4          1    0    0
#5          0    1    1

然后，您可以使用cbind或data.frame进行绑定，

cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))))

Answer 2

使用base-R，此代码将遍历单词列表和每个注释，并检查拆分注释中是否存在每个单词（按空格和标点符号分割），然后重新组合为数据框... < / p>

df <- as.data.frame(do.call(cbind,lapply(word.list,function(w) 
          as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \\.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)

df
                                                                        comments very experience glad
1 very good experience. first time I have been and I would definitely come back.    1          1    0
2                                               glad I scheduled an appointment.    0          0    1
3                                            the staff have become more cordial.    0          0    0
4                                      the experience i had was not good at all.    0          1    0
5                                                                 i am very glad    1          0    1

Answer 3

循环遍历word.list并使用grepl：

sapply(word.list, function(i) as.numeric(grepl(i, comments)))

要获得漂亮的输出，请转换为数据帧：

data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))

注意： grepl将匹配＆＃34;非常＆＃34;与＆＃34; veryX＆＃34;。如果不需要，则需要complete word matching。

# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\\b", i, "\\b"), comments)))

如何根据单词在评论中的存在来表达并创建指标变量？

3 个答案: