改善numpy.ndarray的循环速度

时间:2015-01-26 01:15:00

标签: python performance python-2.7 numpy

我试图计算数据集中unigrams的互信息。当试图这样做时,我试图在循环numpy ndarrays时提高速度。我有以下代码,我使用已经创建的矩阵' C'使用6018行和27721列来计算PMI矩阵。任何想法如何提高for循环速度(目前运行需要近4个小时)?我在其他一些关于使用Cython的帖子中读到了,但还有其他选择吗?提前,谢谢你的帮助。

# MAKE MUTUAL INFO MATRIX, PMI
print "Creating mutual information matrix"
N = C.sum()
invN = 1/N  # replaced divide by N with multiply by invN in formula below
PMI = np.zeros((C.shape))
row, col = C.shape
for r in xrange(row):  # u
    for c in xrange(r):  # w
        if C[r,c]!=0:  # if they co-occur
            numerator = C[r,c]*invN  # getting number of reviews where u and w co-occur and multiply by invN (numerator)
            denominator = (sum(C[:,c])*invN) * (sum(C[r])*invN)
            pmi = log10(numerator*(1/denominator))
            PMI[r,c] = pmi
            PMI[c,r] = pmi

1 个答案:

答案 0 :(得分:1)

如果你可以废弃循环并利用NumPy的矢量化,你应该获得更快的速度。

我没有尝试过,但这样的事情应该有效:

numerator = C * invN
denominator = (np.sum(C, axis=0) * invN) * (np.sum(C, axis=1)[:,None] * invN)
pmi = np.log10(numerator * (1 / denominator))

请注意,numeratordenominatorpmi都是值数组而不是标量。

此外,您可能不得不以某种方式处理C == 0案例:

pmi = np.log10(numerator[numerator != 0] * (1 / denominator[numerator != 0]))

正如Blckknght在评论中指出的那样,你可以省略一些invN次乘法:

denominator = np.sum(C, axis=0) * np.sum(C, axis=1)[:,None] * invN
pmi = np.log10(C * (1 / denominator))