Question

我实际上希望尽可能快地加速这段代码的＃2，所以我认为尝试Cython可能会有用。但是，我不确定如何在Cython中实现稀疏矩阵。有人可以展示如何将它包装在Cython或Julia中以使其更快？

#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.

import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix

full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
    u_dict[q] = i


shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)


def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

#2) I need to speed up this part
# train_full is pandas dataframe with two collumns w1 and w2 filled with strings

H = load_sparse_csr('matrix.npz')

correlation_train = []
for idx, row in train_full.iterrows():
    if idx%1000 == 0: print idx
    id_1 = u_dict[row['w1']]
    id_2 = u_dict[row['w2']]
    a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
    b_vec = H[id_2].toarray()
    correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])

Answer 1

虽然我很久以前就为How to properly pass a scipy.sparse CSR matrix to a cython function?做出了贡献，但我怀疑cython是否可行。特别是如果您还没有使用numpy和cython的经验。当您使用可以转换为C的代码替换迭代计算而不调用cython或其他numpy代码时，python可以提供最大的加速。将pandas扔进混合中，你就会有更大的学习曲线。

sparse代码的重要部分已经使用cython编写。

在没有触及cython问题的情况下，我发现了一些问题。

H定义了两次：

H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')

这可能是疏忽，也可能是无法理解如何创建和分配Python变量。第二个任务取代第一个;因此第一个什么也没做。另外，第一个只是制作一个空的lil矩阵。这样的矩阵可以迭代填充;虽然不快，但它是lil格式的预期用途。

第二个表达式根据保存在npz文件中的数据创建一个新矩阵。这涉及加载的numpy npz文件以及基本的csr矩阵创建代码。由于属性已经采用csr格式，因此cython触摸没有任何内容。

你在这里有一个迭代 - 但是在Pandas数据帧上：

for idx, row in train_full.iterrows():
    id_1 = u_dict[row['w1']]
    a_vec = H[id_1].toarray()

看起来您正在根据字典/数组查找选择H的特定行。与密集矩阵索引相比，稀疏矩阵索引较慢。也就是说，如果Ha = H.toarray()适合你的记忆，那么

a_vec = Ha[id_1,:]

会快得多。

之前已经要求从稀疏矩阵中更快地选择行（或列）。如果您可以直接使用行的稀疏数据，我可以更直接地推荐一些东西。但是你想要一个可以传递给np.corrcoef的密集数组，所以我们也必须实现toarray步骤。

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

是否可以将此Python代码转换为Cython？

1 个答案: