Question

我想在单个分布p和稀疏矩阵dist_mat的每一行之间找到Hellinger distance。我想返回一个维数为1 * N的向量，其中N是dist_mat中的行数。

def hellinger(p, dist_mat):
    return np.sqrt(1/2) * np.sqrt(  np.sum((np.sqrt(p) - np.sqrt(dist_mat))**2)  )

使用上面的函数，如果我们试用一个测试用例：

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, (row, col)), shape=(3, 3)).toarray()
test = np.array([0,21,0])
hellinger(test,csr_matrix((data, (row, col)), shape=(3, 3)))
>>> 4.3633103660024926

返回标量，而不是向量。因此，对于上面的示例，我想要一个包含hellinger距离的结果列表。类似的东西：

hellinger(test,csr_matrix((data, (row, col)), shape=(3, 3)))
>>> [3.46,3.46,2.78] # hellinger distance between test and each row of the csr sparse matrix

有没有什么方法可以使用numpy表示法返回所需的距离向量，也许使用np.apply_along_axis方法？我以前见过这个，但似乎无法在这里得到它。提前谢谢。

注意：我想避免显式的for循环，因为这些效率很低。我正在寻找最优化/最快的方法。

Answer 1

矢量化解决方案

这是我通过一些优化和一个关键技巧得出的最终矢量化解决方案，假设s为csr_matrix类型的输入稀疏矩阵。

k1 = np.sqrt(1/2)
k2s = np.sqrt(test.dot(test))
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s*test))

播放历史

最终的矢量化解决方案是经过一系列的优化后得出的，我会尝试回放给我和其他人参考，我为这里的冗长而道歉，但我觉得这是必要的。

第1阶段

首先在循环中插入func定义：

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = np.sqrt(1/2) * np.sqrt(  np.sum((np.sqrt(test) - np.sqrt(ai))**2)  )

阶段＃2

获取常数并在外面执行平方根：

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = np.sum((k2 - np.sqrt(ai))**2)

out = np.sqrt(out)
out *= k1

第3阶段（关键技巧）

这里的关键技巧，因为我们将使用数学公式：

(A-B)**2 = A**2) + B**2 - 2*A*B

因此，

sum((A-B)**2) = sum(A**2) + sum(B**2) - 2*sum(A*B)

最后一部分sum(A*B)只是矩阵乘法，这是主要的性能助推器。

简化为：

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = (k2**2).sum() + (np.sqrt(ai))**2).sum() -2*np.sqrt(ai).dot(k2)

out = np.sqrt(out)
out *= k1

进一步简化为：

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = (k2**2).sum() + ai.sum() -2*np.sqrt(ai).dot(k2)

out = np.sqrt(out)
out *= k1

阶段＃4

获取常量(k2**2).sum()并获得稀疏矩阵的逐行求和：

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
k2s = (k2**2).sum()

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] =  -2*np.sqrt(ai).dot(k2)

out += k2s + s.sum(1).A1 # row-wise summation of sparse matrix added here
out = np.sqrt(out)
out *= k1

第5阶段

最后一招是完全删除循环。因此，在循环中，我们使用np.sqrt(s[i]).dot(k2)计算每个输出元素。矩阵乘法可以在所有行中完成，只需：np.sqrt(s)*k2。这就是全部！

遗体将是：

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
k2s = (k2**2).sum()

out = -2*np.sqrt(s)*k2 # Loop gone here
out += k2s + s.sum(1).A1
out = np.sqrt(out)
out *= k1

使用inner点积来获取k2s -

之后的简化

k1 = np.sqrt(1/2)
k2 = np.sqrt(test)
k2s = k2.dot(k2)
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s)*k2)

我们可以避免test的平方根计算得到k2，从而进一步简化了这样的事情 -

k1 = np.sqrt(1/2)
k2s = np.sqrt(test.dot(test))
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s*test))

Answer 2

基于@Divakar的答案......

如果我们输入Hellinger函数的分布$ P $和$ Q $被归一化（所以$ \ sum_i P_i = 1 $）（它们在我的情况下），那么函数简化为

$ \ sqrt {1 - \ sum_i ^ k \ sqrt {P_iQ_i}} $

即使$ Q $是行方式分布向量的矩阵，这也有效。

所以我们可以写

def hellinger(p,dist_mat): 
    out = np.sqrt(1 - np.sqrt(dist_mat*p.T).toarray())
    return out.T[0]

一些要点

我使用.toarray()因为我无法从稀疏向量中减去标量 - 我们会得到错误NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
我需要在dist_mat和p.T之间采用点积，p的转置，所以尺寸匹配
我返回out转置，它是第一个元素，输出所需的向量
适用于稀疏输入p和dist_mat

矢量化稀疏矩阵的hellinger - NumPy / Python

2 个答案:

矢量化解决方案

播放历史