Question

我正在尝试用Python编写一个函数（仍然是一个菜鸟！），它返回由tfidf分数的内积所订购的文档的索引和分数。程序是：

在doc idx和所有其他文档之间计算内部产品的向量
按降序排序
将“得分”和指数从第二个返回到结尾（即不是自身）

我目前的代码是：

import h5py
import numpy as np

def get_related(tfidf, idx) :
    ''' return the top documents '''

    # calculate inner product   
    v = np.inner(tfidf, tfidf[idx].transpose())

    # sort
    vs = np.sort(v.toarray(), axis=0)[::-1]
    scores = vs[1:,]

    # sort indices
    vi = np.argsort(v.toarray(), axis=0)[::-1]
    idxs = vi[1:,] 

    return (scores, idxs)

其中tfidf是sparse matrix of type '<type 'numpy.float64'>'。

这似乎效率低下，因为排序执行了两次（sort()然后argsort()），结果必须反过来。

这可以更有效地完成吗？
可以在不使用toarray()？

Answer 1

我认为没有必要跳过toarray。 v数组的长度仅为n_docs，这与实际情况下n_docs×n_terms tf-idf矩阵的大小相比相形见绌。此外，它将非常密集，因为两个文档共享的任何术语将使它们具有非零相似性。当您存储的矩阵非常稀疏时，稀疏矩阵表示只会得到回报（我已经看到了Matlab的80％数据，并且假设Scipy会相似，尽管我没有确切的数字）。

执行

可以跳过双重排序

v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]

顺便说一下，你在稀疏矩阵上使用np.inner不适用于最新版本的NumPy;采用两个稀疏矩阵的内积的安全方法是

v = (tfidf * tfidf[idx, :]).transpose()

在python中反向排序和argsort

1 个答案: