我有一个单词矢量和一个评论载体:
word.list <- c("very", "experience", "glad")
comments <- c("very good experience. first time I have been and I would definitely come back.",
"glad I scheduled an appointment.",
"the staff have become more cordial.",
"the experience i had was not good at all.",
"i am very glad")
我想创建一个看起来像
的数据框df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
"glad I scheduled an appointment.",
"the staff have become more cordial.",
"the experience i had was not good at all.",
"i am very glad"),
very = c(1,0,0,0,1),
glad = c(0,1,0,0,1),
experience = c(1,0,0,1,0))
我有12,000多条评论和20个单词我想这样做。我该如何有效地做到这一点?对于循环?还有其他方法吗?
答案 0 :(得分:3)
一种方法是stringi
和gdapTools
包的组合,即
library(stringi)
library(qdapTools)
mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
# experience glad very
#1 1 0 1
#2 0 1 0
#3 0 0 0
#4 1 0 0
#5 0 1 1
然后,您可以使用cbind
或data.frame
进行绑定,
cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))))
答案 1 :(得分:2)
使用base-R,此代码将遍历单词列表和每个注释,并检查拆分注释中是否存在每个单词(按空格和标点符号分割),然后重新组合为数据框... < / p>
df <- as.data.frame(do.call(cbind,lapply(word.list,function(w)
as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \\.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)
df
comments very experience glad
1 very good experience. first time I have been and I would definitely come back. 1 1 0
2 glad I scheduled an appointment. 0 0 1
3 the staff have become more cordial. 0 0 0
4 the experience i had was not good at all. 0 1 0
5 i am very glad 1 0 1
答案 2 :(得分:2)
循环遍历word.list并使用grepl:
sapply(word.list, function(i) as.numeric(grepl(i, comments)))
要获得漂亮的输出,请转换为数据帧:
data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))
注意: grepl将匹配&#34;非常&#34;与&#34; veryX&#34;。如果不需要,则需要complete word matching。
# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\\b", i, "\\b"), comments)))