我试图计算数据集中unigrams的互信息。当试图这样做时,我试图在循环numpy ndarrays时提高速度。我有以下代码,我使用已经创建的矩阵' C'使用6018行和27721列来计算PMI矩阵。任何想法如何提高for循环速度(目前运行需要近4个小时)?我在其他一些关于使用Cython的帖子中读到了,但还有其他选择吗?提前,谢谢你的帮助。
# MAKE MUTUAL INFO MATRIX, PMI
print "Creating mutual information matrix"
N = C.sum()
invN = 1/N # replaced divide by N with multiply by invN in formula below
PMI = np.zeros((C.shape))
row, col = C.shape
for r in xrange(row): # u
for c in xrange(r): # w
if C[r,c]!=0: # if they co-occur
numerator = C[r,c]*invN # getting number of reviews where u and w co-occur and multiply by invN (numerator)
denominator = (sum(C[:,c])*invN) * (sum(C[r])*invN)
pmi = log10(numerator*(1/denominator))
PMI[r,c] = pmi
PMI[c,r] = pmi
答案 0 :(得分:1)
如果你可以废弃循环并利用NumPy的矢量化,你应该获得更快的速度。
我没有尝试过,但这样的事情应该有效:
numerator = C * invN
denominator = (np.sum(C, axis=0) * invN) * (np.sum(C, axis=1)[:,None] * invN)
pmi = np.log10(numerator * (1 / denominator))
请注意,numerator
,denominator
和pmi
都是值数组而不是标量。
此外,您可能不得不以某种方式处理C == 0
案例:
pmi = np.log10(numerator[numerator != 0] * (1 / denominator[numerator != 0]))
正如Blckknght在评论中指出的那样,你可以省略一些invN
次乘法:
denominator = np.sum(C, axis=0) * np.sum(C, axis=1)[:,None] * invN
pmi = np.log10(C * (1 / denominator))