如何根据单词在评论中的存在来表达并创建指标变量?

时间:2017-04-27 13:00:07

标签: r regex grepl

我有一个单词矢量和一个评论载体:

word.list <- c("very", "experience", "glad")

comments  <- c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad")

我想创建一个看起来像

的数据框
df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad"),
               very = c(1,0,0,0,1),
               glad = c(0,1,0,0,1),
               experience = c(1,0,0,1,0))

我有12,000多条评论和20个单词我想这样做。我该如何有效地做到这一点?对于循环?还有其他方法吗?

3 个答案:

答案 0 :(得分:3)

一种方法是stringigdapTools包的组合,即

library(stringi)
library(qdapTools)

mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
#  experience glad very
#1          1    0    1
#2          0    1    0
#3          0    0    0
#4          1    0    0
#5          0    1    1

然后,您可以使用cbinddata.frame进行绑定,

cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|'))))) 

答案 1 :(得分:2)

使用base-R,此代码将遍历单词列表和每个注释,并检查拆分注释中是否存在每个单词(按空格和标点符号分割),然后重新组合为数据框... < / p>

df <- as.data.frame(do.call(cbind,lapply(word.list,function(w) 
          as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \\.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)

df
                                                                        comments very experience glad
1 very good experience. first time I have been and I would definitely come back.    1          1    0
2                                               glad I scheduled an appointment.    0          0    1
3                                            the staff have become more cordial.    0          0    0
4                                      the experience i had was not good at all.    0          1    0
5                                                                 i am very glad    1          0    1

答案 2 :(得分:2)

循环遍历word.list并使用grepl:

sapply(word.list, function(i) as.numeric(grepl(i, comments)))

要获得漂亮的输出,请转换为数据帧:

data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))

注意: grepl将匹配&#34;非常&#34;与&#34; veryX&#34;。如果不需要,则需要complete word matching

# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\\b", i, "\\b"), comments)))