在quanteda dfm中为不同的术语指定不同的数字权重不起作用

时间:2016-10-04 17:40:55

标签: r quanteda

我是文本分析的新手,目前正在R中尝试#Quanteda软件包以满足我的需求。我想为某些特定的分配不同的数字权重并测试模型的准确性。我尝试了其他线程中提到的方法,通过保留dfm类来实现这一点 Assigning weights to different features in R但无法获得正确的输出。任何帮助将不胜感激。

这是我试过的

##install.packages("quanteda")
require(quanteda)
str <- c("apple is better than banana", "banana banana apple much  
better","much much better new banana")

weights <- c(apple = 5, banana = 3, much = 0.5)
myDfm <- dfm(str, remove = stopwords("english"), verbose = FALSE)

#output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
##   features
##docs    apple better banana much new
##text1     1      1      1    0   0
##text2     1      1      2    1   0
##text3     0      1      1    2   1

newweights <- weights[featnames(myDfm)]
# reassign 1 to non-matched NAs
newweights[is.na(newweights)] <- 1

# this does not works for me - see the output
myDfm * newweights

##output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
##   features
##docs    apple better banana much new
##text1     5    0.5    1.0    0   0
##text2     1    1.0    6.0    5   0
##text3     0    5.0    0.5    2   1

环境详情

平台x86_64-w64-mingw32
   拱x86_64
   os mingw32
   system x86_64,mingw32
   状态
   专业3    未成年人2.2
   2015年    月08
   第14天    svn rev 69053
   语言R
   version.string R版本3.2.2(2015-08-14)    昵称消防安全

1 个答案:

答案 0 :(得分:0)

这显然与dfm类所基于的 Matrix 包中的*运算符有关。这有效:

> matrix(1:6, nrow = 3) * c(2, 3)
     [,1] [,2]
[1,]    2   12
[2,]    6   10
[3,]    6   18

但这不是:

> Matrix::Matrix(matrix(1:6, nrow = 3)) * c(2, 3)
Error in Matrix(matrix(1:6, nrow = 3)) * c(2, 3) : 
  length of 2nd arg does not match dimension of first

在我们解决这个问题之前,这里有一个解决方法:使权重向量逐个元素地与dfm对应。

myDfm * rep(newweights, each = ndoc(myDfm))
## Document-feature matrix of: 3 documents, 5 features.
## 3 x 5 sparse Matrix of class "dfmSparse"
##        features
## docs    apple better banana much new
##   text1     5      1      3  0     0
##   text2     5      1      6  0.5   0
##   text3     0      1      3  1.0   1

<强>更新

这不是一个错误,而是一个功能,并且与向量newweights如何被回收以符合与其相乘的矩阵有关。 R使用列主要顺序循环使用此向量,因此它创建了以下矩阵,在此示例中(尽管不是您想要的),执行逐元素乘法:

matrix(rep(newweights, 3), nrow = 3)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5  0.5  1.0    1  3.0
## [2,]    1  1.0  3.0    5  0.5
## [3,]    3  5.0  0.5    1  1.0

如果您想使用原始策略,这将有效:

t(t(myDfm) * newweights)
## Document-feature matrix of: 3 documents, 5 features (26.7% sparse).
## 3 x 5 sparse Matrix of class "dfmSparse"
##        features
## docs    apple better banana much new
##   text1     5      1      3  0     0
##   text2     5      1      6  0.5   0
##   text3     0      1      3  1.0   1

因为现在通过功能而不是文档进行回收。