Question

我的问题很简单，R中的Quanteda包具有计算文档频率矩阵（dfm）的Term Frequency（tf）的功能。当你用？tf查看tf函数的描述时，它说它有四个参数。我的问题是关于'计划'的论点。我不明白如何使用maxCount选项，即使用每个文档的最大要素数作为tf标准化的除数。当你看'使用'时，方案参数的唯一选项是“count”，“prop”，“propmax”，“boolean”，“log”，“augmented”和“logave”，那么，我该如何使用maxCount选项？

Answer 1

简短的回答是，这是一个＆＃34;错误＆＃34;在文档中（对于quanteda 0.9.8.0-0.9.8.2），该选项已从函数中删除，但未从文档中删除。新语法是propMax参数，例如：

txt <- c(doc1 = "This is a simple, simple, simple document.",
         doc2 = "This document is a second document.")
(myDfm <- dfm(txt, verbose = FALSE))
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs   this is a simple document second
##   doc1    1  1 1      3        1      0
##   doc2    1  1 1      0        2      1

应用权重：

tf(myDfm, scheme = "prop")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs        this        is         a    simple  document    second
##   doc1 0.1428571 0.1428571 0.1428571 0.4285714 0.1428571 0        
##   doc2 0.1666667 0.1666667 0.1666667 0         0.3333333 0.1666667

propmax应该计算每个计数相对于文档中最常见计数的比例。例如，对于doc1，最大特征计数为3，因此该文档中的每个术语将除以3.但是在quanteda＆lt; = 0.9.8.2中，有一个 bug 导致它错误地计算：

tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs        this        is         a simple  document    second
##   doc1 1.0000000 1.0000000 1.0000000      3 1.0000000 0        
##   doc2 0.3333333 0.3333333 0.3333333      0 0.6666667 0.3333333

在quanteda v0.9.8.3中，这是固定的：

tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs        this        is         a simple  document second
##   doc1 0.3333333 0.3333333 0.3333333      1 0.3333333    0  
##   doc2 0.5000000 0.5000000 0.5000000      0 1.0000000    0.5

注意：已在0.9.8.3中修复。

R如何在Quanteda包

1 个答案: