我的目标是使用R进行基于词典的情感分析!
我有两个角色向量。一个有积极的话,一个有负面的词。 例如
pos <- c("good", "accomplished", "won", "happy")
neg <- c("bad", "loss", "damaged", "sued", "disaster")
我现在有成千上万的新闻文章,我想知道每篇文章,如何 我的向量pos和neg的许多元素都在文章中。
e.g。 (不确定语料库功能如何在这里工作,但你明白了:我的语料库中有两篇文章)
mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")
我想得到这样的东西:
article 1: 2 element of pos and 0 element of neg
article 2: 0 elements of pos, 2 elements of neg
如果我能为每篇文章获得以下内容,那么另一件好事就是:
(pos字数 - 字数)/(文章中总字数)
非常感谢你!编辑:
@Victorp:这似乎不起作用
矩阵我看起来很好:
mytdm[1:6,1:10]
Docs
Terms 1 2 3 4 5 6 7 8 9 10
aaron 0 0 0 0 0 1 0 0 0 0
abandon 1 1 0 0 0 0 0 0 0 0
abandoned 0 0 0 3 0 0 0 0 0 0
abbey 0 0 0 0 0 0 0 0 0 0
abbott 0 0 0 0 0 0 0 0 0 0
abbotts 0 0 1 0 0 0 0 0 0 0
但是当我执行你的命令时,每个文档都会得到零!
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
为什么会这样?
答案 0 :(得分:1)
您好,您可以使用TermDocumentMatrix执行此操作:
mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")))
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE))
mytdm <- as.matrix(mytdm)
# Positive words
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2
2 0
# Negative words
colSums(mytdm[rownames(mytdm) %in% neg, ])
1 2
0 2
# Total number of words per documents
colSums(mytdm)
1 2
9 5
答案 1 :(得分:1)
这是另一种方法:
## pos <- c("good", "accomplished", "won", "happy")
## neg <- c("bad", "loss", "damaged", "sued", "disaster")
##
## mycorpus <- Corpus(VectorSource(
## list("The CEO is happy that they finally won the case.",
## "The disaster caused a huge loss.")))
library(qdap)
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg)))
## docs word.count pos neg
## 1 1 10 2(20.00%) 0
## 2 2 6 0 2(33.33%)