Question

在python / numpy中，我有一个名为random_matrix的10,000x10,000数组。我使用md5计算str(random_matrix)和random_matrix本身的哈希值。字符串版本需要0.00754404067993秒，numpy阵列版本需要1.6968960762秒。当我进入20,000x20,000阵列时，字符串版本需要0.0778470039368，numpy阵列版本需要60.641119957秒。为什么是这样？ numpy数组比字符串占用更多内存吗？另外，如果我想通过这些矩阵识别文件名，在计算哈希值得好的想法之前转换为字符串，还是有一些缺点？

Answer 1

str(random_matrix)不会包含所有矩阵，因为numpy用“......”表示错误：

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

因此，当您散列str(random_matrix)时，您并没有真正散列所有数据。

请参阅this previous question和this one，了解如何散列numpy数组。

为什么md5在字符串上的散列速度比在python中的numpy数组上快得多？

1 个答案: