vocab
wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt
wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912
对于vocab中的每个wordIDx,我需要计算以下公式: 例如wordIDx = 1; 我的值应该是
max(log(0.02813027)+sum(log(2/47821),log(2/47821)))
= -23.73506
我现在有以下代码:
classifier_3$ans<- max(log(classifier_3$`P(W_j)`)+ (sum(log(classifier_3$`wordID/newsgroup`/classifier_3$`totalwords/newsgroup`))))
我该如何循环使用,以考虑到vocab数据帧中的所有wordIDx并计算上面突出显示的示例。
答案 0 :(得分:1)
类似的事情,但是您确实需要清理列名。
vocab <- read.table(text = "wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- read.table(text = "wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- classifier_3[!duplicated(classifier_3$wordIDx), ]
classifier_3 <- merge(vocab, classifier_3, by = c("wordIDx"))
classifier_3$ans<- pmax(log(classifier_3$`P.W_j.`)+
(log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`) +
# isn't that times 2?
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`)),
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`))