Question

考虑从REPO_A https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf的13.1复制示例的普通示例

SocketChannel

根据文档，An Introduction to Information Retrieval是txt <- c(d1 = "Chinese Beijing Chinese", d2 = "Chinese Chinese Shanghai", d3 = "Chinese Macao", d4 = "Tokyo Japan Chinese", d5 = "Chinese Chinese Chinese Tokyo Japan") trainingset <- dfm(txt, tolower = FALSE) trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE) tmod1 <- textmodel_nb(trainingset, y = trainingclass, prior = "docfreq")。如何计算？我认为我们关心的是另一种方式，即PcGw。

posterior class probability given the word

谢谢！

Answer 1

在您引用的书章中清楚地说明了该应用程序，但从本质上讲，不同之处在于PcGw是“给定单词的类的概率”，而PwGc是“给定单词的类的概率”。前者是后验，我们需要使用联合概率（在 quanteda 中，使用predict()函数应用）来计算一组单词的类成员资格的概率。后者只是来自每个类别中要素相对频率的可能性，默认情况下通过在类别中的计数上加一个来平滑。

如果需要，可以验证此内容。首先，按培训班将培训文档分组，然后对其进行平滑处理。

trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
    dfm_smooth(smoothing = 1)
trainingset_bygroup
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
#     features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
#    N       2       1        1     1     2     2
#    Y       6       2        2     2     1     1

然后您会看到（平滑的）单词似然与PwGc相同。

trainingset_bygroup / rowSums(trainingset_bygroup)
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
#     features
# docs   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
#    N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
#    Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

tmod1$PwGc
#        features
# classes   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
#       N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
#       Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

但是您可能更关心P（class | word），因为这就是贝叶斯公式的全部内容，并且合并了先前的类概率P（c）。

在Quanteda的朴素贝叶斯中如何计算PcGw？

1 个答案: