如何使用NumPy在Python中快速填充100000x100000矩阵?

时间:2019-04-07 08:40:36

标签: python python-3.x algorithm matrix data-structures

我真的很喜欢数据结构和算法。

我正在使用80000 X 80000矩阵插入数据。我正在使用numpy。而且,我的代码如下所示:

n = 80000
similarity = np.zeros((n, n), dtype='int8')
for i, photo_i in enumerate(photos):
    for j, photo_j in enumerate(photos[i:]):
       similarity[i, j] = score(photo_i, photo_j)
    if i % 100 == 0:
        print(i)

这段代码花费太多时间。 score函数为O(1)。我想知道是否有更好的方法可以做到这一点。我想在“短时间内”绘制此矩阵的数据。但是,我这样做的方式具有O(n ^ 2)的复杂度。

是否存在“任何内容”,可以对其进行“优化”或使用不同的数据结构?

我已经阅读了关于SO的类似问题,他们提到了pytables。我一定会尝试,但还不知道如何。欢迎任何建议。

谢谢。

1 个答案:

答案 0 :(得分:1)

There's a bunch of different things you could do, which all revolve around avoiding the explicit for-loops, which are slow in Python, and delegating to C-level code (either using Python's underlying C runtime or numpy's builtin array creation methods).

Using fromfunction

Numpy has a built-in function for populating a matrix from a function taking coordinates: numpy.fromfunction. This might be faster since it does all the iteration and assignment in C instead of Python.

You'd have to supply it a score-by-coordinates function, e.g.:

def similarity_value(i, j, photos=photos):
  return score(photos[i], photos[j])

similarity = numpy.fromfunction(similarity_value, (n, n), dtype='int8')

The photos=photos in the function definition makes the photos array a local of the function and saves some time accessing it on each invocation; this is a common Python micro-optimization technique.

Note that this computes the similarity for the entire matrix instead of just a triangle. To fix this, you could do:

def similarity_value(i, j, photos=photos):
  return score(photos[i], photos[j]) if i < j else 0

similarity = numpy.fromfunction(similarity_value, (n, n), dtype='int8')
similarity += similarity.T  # fill in other triangle from transposed matrix

Using comprehensions

You could also try creating the similarity matrix from a generator comprehension (or even a list comprehension), again avoiding the explicit for-loops in favor of a comprehension which is faster, but sacrificing the triangle optimization:

similarity = numpy.fromiter((score(photo_i, photo_j) 
                             for photo_i in photos 
                             for photo_j in photos),
                            shape=(n,n), dtype='int8')

# or:
similarity = numpy.array([score(photo_i, photo_j) 
                          for photo_i in photos 
                          for photo_j in photos],
                         shape=(n,n), dtype='int8')

To re-introduce the triangle optimization, you could do something like:

similarity = numpy.array([score(photo_i, photo_j) if i < j else 0
                          for i, photo_i in enumerate(photos)
                          for j, photo_j in enumerate(photos)],
                         shape=(n,n), dtype='int8')
similarity += similarity.T

Using triu_indices to populate a triangle directly

Finally, you could use numpy.triu_indices to assign directly into the matrix's upper (and then lower) triangle:

similarity_values = (score(photo_i, photo_j
                     for photo_i in photos
                     for photo_j in photos[:i])  # only computing values for the triangle
similarity = np.zeroes((n,n), dtype='int8')
xs, ys = np.triu_indices(n, 1)
similarity[xs, ys] = similarity_values
similarity[ys, xs] = similarity_values
similarity[np.diag_indices(n)] = 1  # assuming score(x, x) == 1

This approach is inspired by this related question: https://codereview.stackexchange.com/questions/107094/create-symmetrical-matrix-from-list-of-values

I don't have a means of benchmarking which of these approaches would work best, but you could experiment and find out. Good luck!