Question

我有一个大型数组artist_topic_probs（112个项目行~100个特征列），我想计算这个数组中随机行对的（大样本）之间的成对余弦相似度。这是我当前代码的相关位

# the number of random pairs to check (10 million here)
random_sample_size=10000000

# I want to make sure they're unique, and that I'm never comparing a row to itself
# so I generate my set of comparisons like so:
np.random.seed(99)
comps = set()
while len(comps)<random_sample_size:
    a = np.random.randint(0,112312)
    b= np.random.randint(0,112312)
    if a!=b:
        comp = tuple(sorted([a,b]))
        comps.add(comp)
# convert to list at the end to ensure sort order 
# not positive if this is needed...I've seen conflicting opinions
comps = list(sorted(comps))

这会生成一个元组列表，其中每个元组都是两行，我将在这两行之间计算相似度。然后我只使用一个简单的循环来计算所有相似之处：

c_dists = []
from scipy.spatial.distance import cosine
for a,b in comps:
    c_dists.append(cosine(artist_topic_probs[a],artist_topic_probs[b]))

（当然，cosine这里给出了距离，而不是相似度，但我们可以轻松地使用sim = 1.0 - dist来获取它。我使用了相似度在标题中，因为它是更常见的术语）

这很好，但速度不是很快，我需要多次重复这个过程。我有32个内核可以使用，因此并行化似乎是一个不错的选择，但我不确定最好的方法。我的想法是这样的：

pool = mp.Pool(processes=32)
c_dists = [pool.apply(cosine, args=(artist_topic_probs[a],artist_topic_probs[b])) 
    for a,b in comps]

但是在我的笔记本电脑上用一些测试数据测试这种方法并没有起作用（它只是挂起，或者至少花费的时间比我厌倦等待并杀死它的简单循环要长得多）。我关注的是矩阵的索引是某种瓶颈，但我不确定。关于如何有效地并行化（或以其他方式加快过程）的任何想法？

Answer 1

首先，您可能希望将来使用itertools.combinations和random.sample来获取唯一对，但由于内存问题，它在这种情况下无效。然后，多处理不是多线程，即产生新进程涉及巨大的系统开销。为每个单独的任务产生一个过程几乎没有意义。任务必须非常值得合理化启动新流程的开销，因此您最好将所有工作分成单独的工作（与您想要使用的核心数量一样多）。然后，不要忘记multiprocessing实现将整个命名空间序列化并将其加载到内存中N次，其中N是进程数。如果您没有足够的RAM来存储大型阵列的N个副本，这可能会导致密集交换。因此，您可能希望减少核心数量。

已更新以按您的要求恢复初始订单。

我制作了相同向量的测试数据集，因此cosine必须返回零向量。

from __future__ import division, print_function
import math
import multiprocessing as mp
from scipy.spatial.distance import cosine
from operator import itemgetter
import itertools


def worker(enumerated_comps):
    return [(ind, cosine(artist_topic_probs[a], artist_topic_probs[b])) for ind, (a, b) in enumerated_comps]


def slice_iterable(iterable, chunk):
    """
    Slices an iterable into chunks of size n
    :param chunk: the number of items per slice
    :type chunk: int
    :type iterable: collections.Iterable
    :rtype: collections.Generator
    """
    _it = iter(iterable)
    return itertools.takewhile(
        bool, (tuple(itertools.islice(_it, chunk)) for _ in itertools.count(0))
    )


# Test data
artist_topic_probs = [range(10) for _ in xrange(10)]
comps = tuple(enumerate([(1, 2), (1, 3), (1, 4), (1, 5)]))

n_cores = 2
chunksize = int(math.ceil(len(comps)/n_cores))
jobs = tuple(slice_iterable(comps, chunksize))

pool = mp.Pool(processes=n_cores)
work_res = pool.map_async(worker, jobs)
c_dists = map(itemgetter(1), sorted(itertools.chain(*work_res.get())))
print(c_dists)

输出：

[2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16]

这些值非常接近于零。

P.S。

来自multiprocessing.Pool.apply文档

apply()内置函数的等效项。它阻止直到结果已准备好，因此apply_async()更适合执行并行工作。另外，func只在其中一个中执行池中的工人。

Answer 2

正如您在链接中看到的那样，

scipy.spatial.distance.cosine会在您的计算中引入显着的开销，因为对于每次调用，它会计算您在每次调用时分析的两个向量的范数，对于您的大小样品如果你事先记住你的~10万个向量的规范你可以节省大约60％的计算时间，因为你有一个点积，u * v和两个范数计算，每个这三项操作在操作次数方面大致相同。

此外，您正在使用显式循环，如果您可以将逻辑放在矢量化numpy运算符中，则可以修剪计算时间的另一大片。

最后，您谈论余弦相似性 ...考虑scipy.spatial.distance.cosine计算余弦距离，相反，关系很容易，cs = cd - 1但我在你发布的代码中没有看到这个。

并行化python中的数组行相似度计算

2 个答案: