我使用R
库tm
和qdap
来计算给定文本中的单词。当我的向量(words
)只有几个单词时,一切看起来都很好:
library(tm)
library(qdap)
text <- "activat affect affected affecting affects aggravat allow attribut based basis
bc because bosses caus change changed changes changing compel compliance"
text <- Corpus(VectorSource(text))
words <- c("activat", "affect", "affected")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Results:
# docs word.count activat affect affected
# 1 doc 1 20 1(5.00%) 4(20.00%) 1(5.00%)
但是当我的向量(words
)包含太多单词时,结果会出现乱码且无法读取:
words <- c("activat", "affect", "affected", "affecting", "affects", "aggravat", "allow",
"attribut", "based", "basis", "bc", "because", "bosses", "caus", "change",
"changed", "changes", "changing", "compel", "compliance")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Results:
# docs word.count activat affect affected affecting affects aggravat allow
# attribut based basis bc because bosses caus change changed
# changes changing compel compliance
# 1 doc 1 20 1(5.00%) 4(20.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%)
# 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 2(10.00%) 3(15.00%) 1(5.00%)
# 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%)
如何在数据框/矩阵中显示结果,以便我更容易阅读?
我尝试使用termco2mat
(qdap
库),据说&#34;返回一个术语计数矩阵&#34; (https://trinker.github.io/qdap/termco.html)就像这样(请见下文),但我收到了错误消息:
apply_as_df(text, termco2mat, match.list=words)
# Results:
# Error in qdapfun(text.var = text, ...) :
# unused arguments (text.var = text, match.list = c("activat", "affect", "affected",
# "affecting", "affects", "aggravat", "allow", "attribut", "based", "basis", "bc",
# "because", "bosses", "caus", "change", "changed", "changes", "changing", "compel",
# "compliance"))
或者:
termco2mat(apply_as_df(text, termco, match.list=words))
# Results:
# Error in `rownames<-`(`*tmp*`, value = "doc 1") :
# attempt to set 'rownames' on an object with no dimensions
答案 0 :(得分:0)
这是一个没有qdap的解决方案:
library(tm)
text1 <- "activat affect affected affecting affects aggravat allow attribut"
text2 <- "based basis bc because bosses caus change changed changes changing compel compliance"
text <- Corpus(VectorSource(c(text1, text2)))
words <- c("activat", "affect", "affected")
dtm <- DocumentTermMatrix(text)
data.frame(cnt = colSums(as.matrix(dtm[ , words])))
输出
cnt
activat 1
affect 1
affected 1
答案 1 :(得分:0)
我不确定您要做什么,但scores
counts
是如何从列表中提取对象的。也许你想t
转置输出?
apply_as_df(text, termco, match.list=words) %>%
counts() %>%
t()
## docs "doc 1"
## word.count "20"
## activat "1"
## affect "4"
## affected "1"
## affecting "1"
## affects "1"
## aggravat "1"
## allow "1"
## attribut "1"
## based "1"
## basis "1"
## bc "1"
## because "1"
## bosses "1"
## caus "2"
## change "3"
## changed "1"
## changes "1"
## changing "1"
## compel "1"
## compliance "1"