我原本认为在R
个包text2vec
,tm
,quanteda
,svs
,qlcMatrix
和{{1}之间将有一个函数来计算术语和上下文之间的PPMI(正向点互信息)(基于术语 - 术语(上下文)共现的矩阵) - 但显然不是,所以我去了我自己写了一个。问题是,它像糖蜜一样慢,可能是因为我对稀疏矩阵不是很好 - 我的tcms大约是10k * 20k,所以它们确实需要稀疏。
根据我的理解,wordspace
,因此我推断:
PMI = log( p(word, context) / (p(word)*p(context)) )
其中 count(word_context_co-occurrence) / N
PMI = log( ------------------------------------- )
count(word)/N * count(context)/N
是共生矩阵中所有共现的总和。 PPMI只是强制所有< 0值为0.(这是对的,对吧?)
考虑到这一点,这是尝试实施:
N
看起来很慢的不是计算本身,而是将新计算的值放入稀疏矩阵中(在这个微小的例子上它并不坏,但是如果你将数千行数千行,甚至是这个循环的一次迭代将需要永远;用library(Matrix)
set.seed(1)
pmat = matrix(sample(c(0,0,0,0,0,0,1,10),5*10,T), 5,10, byrow=T) # tiny example matrix;
# rows are words, columns are contexts (words the row-words co-occur with, in a certain window in the text)
pmat = Matrix(pmat, sparse=T) # make it sparse
# calculate some things beforehand to make it faster
N = sum(pmat)
contextp = Matrix::colSums(pmat)/N # probabilities of contexts
wordp = Matrix::rowSums(pmat)/N # probabilities of terms
# here goes nothing...
pmat2 = pmat
for(r in 1:nrow(pmat)){ # go term by term, calculate PPMI association with each of its contexts
not0 = which(pmat[r, ] > 0) # no need to consider 0 values (no co-occurrence)
tmp = log( (pmat[r,not0] / N) / (wordp[r] * contextp[not0] )) # PMI
tmp = ifelse(tmp < 0, 0, tmp) # PPMI
pmat2[r, not0] = tmp # <-- THIS here is the slow part, replacing the old frequency values with the new PPMI weighted ones.
}
# take a look:
round(pmat2,2)
构建一个新矩阵似乎是一个更糟糕的主意。)
用新的PPMI加权值替换这种稀疏矩阵中的旧值有什么更有效的方法?要么改变这段代码的建议,要么在某些包中使用某些现有的函数我莫名其妙已经错过了 - 一切都很好。
答案 0 :(得分:0)
同时想出来,这种方法相当快。我会留在这里以防其他人最终遇到同样的问题。也似乎与评论中的方法相关联(谢谢!)。
# this is for a column-oriented sparse matrix; transpose if necessary
tcmrs = Matrix::rowSums(pmat)
tcmcs = Matrix::colSums(pmat)
N = sum(tcmrs)
colp = tcmcs/N
rowp = tcmrs/N
pp = pmat@p+1
ip = pmat@i+1
tmpx = rep(0,length(pmat@x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for(i in 1:(length(pmat@p)-1) ){
ind = pp[i]:(pp[i+1]-1)
not0 = ip[ind]
icol = pmat@x[ind]
tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
tmpx[ind] = tmp
}
pmat@x = tmpx
# to convert to PPMI, replace <0 values with 0 and do a Matrix::drop0() on the object.