我有一组图片,我想将他们的数据哈希到一个ID。
目前我这样做:
import hashlib
import uuid
def get_image_uuid(pil_img):
# Read PIL image data
img_bytes_ = pil_img.tobytes()
# hash the bytes using sha1
bytes_sha1 = hashlib.sha1(img_bytes_)
hashbytes_20 = bytes_sha1.digest()
# sha1 produces 20 bytes, but UUID requires 16 bytes
hashbytes_16 = hashbytes_20[0:16]
uuid_ = uuid.UUID(bytes=hashbytes_16)
return uuid_
这将读取图像中的所有像素数据,这对于确定性的16字节UUID哈希来说是过度的。
有没有办法做这样的事情?
img_bytes = pil_img.tobytes(stride=16)
编辑:我使用此脚本生成了一些详细的计时结果。 我应该提到我使用的图像很大(大约6MB)。我在windows和linux上测试过:
from __future__ import absolute_import, division, print_function
import __builtin__
import time
import timeit
from PIL import Image
import hashlib
import numpy as np
import uuid
# My data getters
from vtool.tests import grabdata
elephant = grabdata.get_testimg_path('elephant.jpg')
lena = grabdata.get_testimg_path('lena.jpg')
zebra = grabdata.get_testimg_path('zebra.jpg')
jeff = grabdata.get_testimg_path('jeff.png')
gpath = elephant
try:
getattr(__builtin__, 'profile')
__LINE_PROFILE__ = True
except AttributeError:
__LINE_PROFILE__ = False
def profile(func):
return func
@profile
def get_image_uuid(img_bytes_):
# hash the bytes using sha1
bytes_sha1 = hashlib.sha1(img_bytes_)
hashbytes_20 = bytes_sha1.digest()
# sha1 produces 20 bytes, but UUID requires 16 bytes
hashbytes_16 = hashbytes_20[0:16]
uuid_ = uuid.UUID(bytes=hashbytes_16)
return uuid_
@profile
def make_uuid_PIL_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
img_bytes_ = pil_img.tobytes()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
@profile
def make_uuid_NUMPY_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
np_img = np.asarray(pil_img)
np_flat = np_img.ravel()
img_bytes_ = np_flat.tostring()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
@profile
def make_uuid_NUMPY_STRIDE_16_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
np_img = np.asarray(pil_img)
np_flat = np_img.ravel()[::16]
img_bytes_ = np_flat.tostring()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
@profile
def make_uuid_NUMPY_STRIDE_64_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
img_bytes_ = np.asarray(pil_img).ravel()[::64].tostring()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
@profile
def make_uuid_CONTIG_NUMPY_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
np_img = np.asarray(pil_img)
np_flat = np_img.ravel().tostring()
np_contig = np.ascontiguousarray(np_flat)
img_bytes_ = np_contig.tostring()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
@profile
def make_uuid_CONTIG_NUMPY_STRIDE_16_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
np_img = np.asarray(pil_img)
np_contig = np.ascontiguousarray(np_img.ravel()[::16])
img_bytes_ = np_contig.tostring()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
@profile
def make_uuid_CONTIG_NUMPY_STRIDE_64_bytes(gpath):
pil_img = Image.open(gpath, 'r')
# Read PIL image data
img_bytes_ = np.ascontiguousarray(np.asarray(pil_img).ravel()[::64]).tostring()
uuid_ = get_image_uuid(img_bytes_)
return uuid_
if __name__ == '__main__':
# cool trick
test_funcs = [
make_uuid_PIL_bytes,
make_uuid_NUMPY_bytes,
make_uuid_NUMPY_STRIDE_16_bytes,
make_uuid_NUMPY_STRIDE_64_bytes,
make_uuid_CONTIG_NUMPY_bytes,
make_uuid_CONTIG_NUMPY_STRIDE_16_bytes,
make_uuid_CONTIG_NUMPY_STRIDE_64_bytes,
]
func_strs = ', '.join([func.func_name for func in test_funcs])
setup = 'from __main__ import (gpath, %s) ' % (func_strs,)
number = 2
for func in test_funcs:
func_name = func.func_name
print('Running: %s' % func_name)
if __LINE_PROFILE__:
start = time.time()
for _ in xrange(number):
func(gpath)
total_time = time.time() - start
else:
stmt = '%s(gpath)' % func_name
total_time = timeit.timeit(stmt=stmt, setup=setup, number=number)
print('timed: %r seconds in %s' % (total_time, func_name))
以下是Windows行配置文件结果:
File: _timeits/time_uuids.py
Function: make_uuid_CONTIG_NUMPY_STRIDE_16_bytes at line 91
Total time: 1.03287 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 @profile
92 def make_uuid_CONTIG_NUMPY_STRIDE_16_bytes(gpath):
93 2 3571 1785.5 0.1 pil_img = Image.open(gpath, 'r')
94 # Read PIL image data
95 2 3310103 1655051.5 96.2 np_img = np.asarray(pil_img)
96 2 44833 22416.5 1.3 np_contig = np.ascontiguousarray(np_img.ravel()
[::16])
97 2 9657 4828.5 0.3 img_bytes_ = np_contig.tostring()
98 2 72560 36280.0 2.1 uuid_ = get_image_uuid(img_bytes_)
99 2 4 2.0 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_CONTIG_NUMPY_STRIDE_64_bytes at line 102
Total time: 1.0385 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
102 @profile
103 def make_uuid_CONTIG_NUMPY_STRIDE_64_bytes(gpath):
104 2 3285 1642.5 0.1 pil_img = Image.open(gpath, 'r')
105 # Read PIL image data
106 2 3436641 1718320.5 99.3 img_bytes_ = np.ascontiguousarray(np.asarray(p
il_img).ravel()[::64]).tostring()
107 2 19570 9785.0 0.6 uuid_ = get_image_uuid(img_bytes_)
108 2 4 2.0 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_NUMPY_STRIDE_64_bytes at line 70
Total time: 1.04175 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
70 @profile
71 def make_uuid_NUMPY_STRIDE_64_bytes(gpath):
72 2 3356 1678.0 0.1 pil_img = Image.open(gpath, 'r')
73 # Read PIL image data
74 2 3447197 1723598.5 99.3 img_bytes_ = np.asarray(pil_img).ravel()[::64]
.tostring()
75 2 19774 9887.0 0.6 uuid_ = get_image_uuid(img_bytes_)
76 2 4 2.0 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_NUMPY_STRIDE_16_bytes at line 59
Total time: 1.0913 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
59 @profile
60 def make_uuid_NUMPY_STRIDE_16_bytes(gpath):
61 2 3706 1853.0 0.1 pil_img = Image.open(gpath, 'r')
62 # Read PIL image data
63 2 3339663 1669831.5 91.9 np_img = np.asarray(pil_img)
64 2 112 56.0 0.0 np_flat = np_img.ravel()[::16]
65 2 217844 108922.0 6.0 img_bytes_ = np_flat.tostring()
66 2 74044 37022.0 2.0 uuid_ = get_image_uuid(img_bytes_)
67 2 4 2.0 0.0 return uuid_
File: _timeits/time_uuids.py
Function: get_image_uuid at line 28
Total time: 1.10141 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
28 @profile
29 def get_image_uuid(img_bytes_):
30 # hash the bytes using sha1
31 14 3665965 261854.6 99.9 bytes_sha1 = hashlib.sha1(img_bytes_)
32 14 326 23.3 0.0 hashbytes_20 = bytes_sha1.digest()
33 # sha1 produces 20 bytes, but UUID requires 16
bytes
34 14 75 5.4 0.0 hashbytes_16 = hashbytes_20[0:16]
35 14 2661 190.1 0.1 uuid_ = uuid.UUID(bytes=hashbytes_16)
36 14 40 2.9 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_PIL_bytes at line 39
Total time: 1.33926 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
39 @profile
40 def make_uuid_PIL_bytes(gpath):
41 2 25940 12970.0 0.6 pil_img = Image.open(gpath, 'r')
42 # Read PIL image data
43 2 3277455 1638727.5 73.5 img_bytes_ = pil_img.tobytes()
44 2 1158009 579004.5 26.0 uuid_ = get_image_uuid(img_bytes_)
45 2 4 2.0 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_NUMPY_bytes at line 48
Total time: 1.39694 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
48 @profile
49 def make_uuid_NUMPY_bytes(gpath):
50 2 3406 1703.0 0.1 pil_img = Image.open(gpath, 'r')
51 # Read PIL image data
52 2 3344608 1672304.0 71.9 np_img = np.asarray(pil_img)
53 2 46 23.0 0.0 np_flat = np_img.ravel()
54 2 133593 66796.5 2.9 img_bytes_ = np_flat.tostring()
55 2 1171888 585944.0 25.2 uuid_ = get_image_uuid(img_bytes_)
56 2 5 2.5 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_CONTIG_NUMPY_bytes at line 79
Total time: 1.4899 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
79 @profile
80 def make_uuid_CONTIG_NUMPY_bytes(gpath):
81 2 3384 1692.0 0.1 pil_img = Image.open(gpath, 'r')
82 # Read PIL image data
83 2 3376051 1688025.5 68.0 np_img = np.asarray(pil_img)
84 2 133156 66578.0 2.7 np_flat = np_img.ravel().tostring()
85 2 146959 73479.5 3.0 np_contig = np.ascontiguousarray(np_flat)
86 2 149330 74665.0 3.0 img_bytes_ = np_contig.tostring()
87 2 1154328 577164.0 23.3 uuid_ = get_image_uuid(img_bytes_)
88 2 4 2.0 0.0 return uuid_
以下是Linux线路配置文件结果:
File: _timeits/time_uuids.py
Function: make_uuid_NUMPY_STRIDE_64_bytes at line 70
Total time: 0.456272 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
70 @profile
71 def make_uuid_NUMPY_STRIDE_64_bytes(gpath):
72 2 449 224.5 0.1 pil_img = Image.open(gpath, 'r')
73 # Read PIL image data
74 2 452880 226440.0 99.3 img_bytes_ = np.asarray(pil_img).ravel()[::64].
tostring()
75 2 2942 1471.0 0.6 uuid_ = get_image_uuid(img_bytes_)
76 2 1 0.5 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_CONTIG_NUMPY_STRIDE_64_bytes at line 102
Total time: 0.457588 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
102 @profile
103 def make_uuid_CONTIG_NUMPY_STRIDE_64_bytes(gpath):
104 2 445 222.5 0.1 pil_img = Image.open(gpath, 'r')
105 # Read PIL image data
106 2 454269 227134.5 99.3 img_bytes_ = np.ascontiguousarray(np.asarray(pi
l_img).ravel()[::64]).tostring()
107 2 2872 1436.0 0.6 uuid_ = get_image_uuid(img_bytes_)
108 2 2 1.0 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_CONTIG_NUMPY_STRIDE_16_bytes at line 91
Total time: 0.461928 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 @profile
92 def make_uuid_CONTIG_NUMPY_STRIDE_16_bytes(gpath):
93 2 482 241.0 0.1 pil_img = Image.open(gpath, 'r')
94 # Read PIL image data
95 2 436622 218311.0 94.5 np_img = np.asarray(pil_img)
96 2 10990 5495.0 2.4 np_contig = np.ascontiguousarray(np_img.ravel()
[::16])
97 2 2931 1465.5 0.6 img_bytes_ = np_contig.tostring()
98 2 10902 5451.0 2.4 uuid_ = get_image_uuid(img_bytes_)
99 2 1 0.5 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_NUMPY_STRIDE_16_bytes at line 59
Total time: 0.492819 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
59 @profile
60 def make_uuid_NUMPY_STRIDE_16_bytes(gpath):
61 2 481 240.5 0.1 pil_img = Image.open(gpath, 'r')
62 # Read PIL image data
63 2 441343 220671.5 89.6 np_img = np.asarray(pil_img)
64 2 34 17.0 0.0 np_flat = np_img.ravel()[::16]
65 2 39996 19998.0 8.1 img_bytes_ = np_flat.tostring()
66 2 10964 5482.0 2.2 uuid_ = get_image_uuid(img_bytes_)
67 2 1 0.5 0.0 return uuid_
File: _timeits/time_uuids.py
Function: get_image_uuid at line 28
Total time: 0.545926 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
28 @profile
29 def get_image_uuid(img_bytes_):
30 # hash the bytes using sha1
31 14 545037 38931.2 99.8 bytes_sha1 = hashlib.sha1(img_bytes_)
32 14 115 8.2 0.0 hashbytes_20 = bytes_sha1.digest()
33 # sha1 produces 20 bytes, but UUID requires 16
bytes
34 14 24 1.7 0.0 hashbytes_16 = hashbytes_20[0:16]
35 14 742 53.0 0.1 uuid_ = uuid.UUID(bytes=hashbytes_16)
36 14 8 0.6 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_PIL_bytes at line 39
Total time: 0.625736 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
39 @profile
40 def make_uuid_PIL_bytes(gpath):
41 2 3915 1957.5 0.6 pil_img = Image.open(gpath, 'r')
42 # Read PIL image data
43 2 449092 224546.0 71.8 img_bytes_ = pil_img.tobytes()
44 2 172728 86364.0 27.6 uuid_ = get_image_uuid(img_bytes_)
45 2 1 0.5 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_NUMPY_bytes at line 48
Total time: 0.663057 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
48 @profile
49 def make_uuid_NUMPY_bytes(gpath):
50 2 468 234.0 0.1 pil_img = Image.open(gpath, 'r')
51 # Read PIL image data
52 2 437346 218673.0 66.0 np_img = np.asarray(pil_img)
53 2 18 9.0 0.0 np_flat = np_img.ravel()
54 2 51512 25756.0 7.8 img_bytes_ = np_flat.tostring()
55 2 173712 86856.0 26.2 uuid_ = get_image_uuid(img_bytes_)
56 2 1 0.5 0.0 return uuid_
File: _timeits/time_uuids.py
Function: make_uuid_CONTIG_NUMPY_bytes at line 79
Total time: 0.756671 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
79 @profile
80 def make_uuid_CONTIG_NUMPY_bytes(gpath):
81 2 483 241.5 0.1 pil_img = Image.open(gpath, 'r')
82 # Read PIL image data
83 2 437192 218596.0 57.8 np_img = np.asarray(pil_img)
84 2 48152 24076.0 6.4 np_flat = np_img.ravel().tostring()
85 2 49502 24751.0 6.5 np_contig = np.ascontiguousarray(np_flat)
86 2 49269 24634.5 6.5 img_bytes_ = np_contig.tostring()
87 2 172072 86036.0 22.7 uuid_ = get_image_uuid(img_bytes_)
88 2 1 0.5 0.0 return uuid_
以下是Windows timeit结果:
Running: make_uuid_PIL_bytes
timed: 1.4041314945785952 seconds in make_uuid_PIL_bytes
Running: make_uuid_NUMPY_bytes
timed: 1.4475939890251077 seconds in make_uuid_NUMPY_bytes
Running: make_uuid_NUMPY_STRIDE_16_bytes
timed: 1.136886564762671 seconds in make_uuid_NUMPY_STRIDE_16_bytes
Running: make_uuid_NUMPY_STRIDE_64_bytes
timed: 1.0767879228155284 seconds in make_uuid_NUMPY_STRIDE_64_bytes
Running: make_uuid_CONTIG_NUMPY_bytes
timed: 1.5433727380795146 seconds in make_uuid_CONTIG_NUMPY_bytes
Running: make_uuid_CONTIG_NUMPY_STRIDE_16_bytes
timed: 1.0804961515831941 seconds in make_uuid_CONTIG_NUMPY_STRIDE_16_bytes
Running: make_uuid_CONTIG_NUMPY_STRIDE_64_bytes
timed: 1.0577325560451953 seconds in make_uuid_CONTIG_NUMPY_STRIDE_64_bytes
linux timeit结果:
Running: make_uuid_PIL_bytes
timed: 0.6316661834716797 seconds in make_uuid_PIL_bytes
Running: make_uuid_NUMPY_bytes
timed: 0.666496992111206 seconds in make_uuid_NUMPY_bytes
Running: make_uuid_NUMPY_STRIDE_16_bytes
timed: 0.4908161163330078 seconds in make_uuid_NUMPY_STRIDE_16_bytes
Running: make_uuid_NUMPY_STRIDE_64_bytes
timed: 0.4494049549102783 seconds in make_uuid_NUMPY_STRIDE_64_bytes
Running: make_uuid_CONTIG_NUMPY_bytes
timed: 0.7838680744171143 seconds in make_uuid_CONTIG_NUMPY_bytes
Running: make_uuid_CONTIG_NUMPY_STRIDE_16_bytes
timed: 0.462860107421875 seconds in make_uuid_CONTIG_NUMPY_STRIDE_16_bytes
Running: make_uuid_CONTIG_NUMPY_STRIDE_64_bytes
timed: 0.45322108268737793 seconds in make_uuid_CONTIG_NUMPY_STRIDE_64_bytes
所以它看起来像图像的加载是主要的罪魁祸首(因为这些图像是如此之大),但步骤有助于散列一小部分(但很重要)。
能够仅加载该数据的子集仍然是非常好的。有谁知道这样做的方法吗?
答案 0 :(得分:2)
(我在Python 3.6.4上使用Pillow 5.1.0,在macOS 10.13.3上使用)
我最近在使用大于250MB(!)的图像时遇到了类似的问题。我的用例略有不同,因为我需要实际的RGB值,而不是字节,但我发现首先裁剪图像,然后在裁剪区域上运行getdata(),对于& #34;随机访问&#34;到一片图像。具体来说,在30MB的图片上,img.crop(<x,y,w,h>).getdata()
比img.getdata()[<slice>]
快28,000倍。
>>> t0 = time.time(); x = list(img.getdata())[3336*500:3336*500+3]; t1 = time.time(); print(x, t1-t0)
[(92, 102, 136), (110, 153, 220), (114, 184, 232)] 1.6889581680297852
>>> t0 = time.time(); y = list(img.crop((0, 500, 3, 501)).getdata()); t1 = time.time(); print(y, t1-t0)
[(92, 102, 136), (110, 153, 220), (114, 184, 232)] 5.91278076171875e-05
(1.6秒vs 0.000059秒)
同样,这会获得RGB值,而不是图像字节数据,但根据您的需要,这可能是可以接受的。这也有不需要numpy的附带好处,对我来说这是一个加号。
当然,逻辑取决于您需要多少数据,以及从哪里可能需要包装到下一行。这将是丑陋的,可能不值得维护/可读性成本。
答案 1 :(得分:1)
您可以将图像转换为numpy.array
,然后使用切片表示法。您可能希望首先将图片展平为单维数组,您可以使用array.ravel
执行此操作。
>>> import numpy as np
>>> pixels = np.asarray(pil_img)
>>> pixels.shape
(2592, 1936, 3)
>>> subset = pixels.ravel()[::16] #every 16th byte of pixels.
>>> subset.shape
(940896,)
请注意,数组的结果大小等于(2592 * 1936 * 3) / 16
。
修改强>
你的评论让我好奇,所以我继续自己定时。事实证明,hashlib.sha1对它处理的数组有一些额外的要求 - 即它们是连续的并且是'C-order'(如果没有意义的话,不要担心)。
所以我最终不得不做以下事情:
pixels =np.ascontiguousarray(np.asarray(img).ravel()[::16])
hashlib.sha1(pixels)
无论如何,这是时间结果:
In [27]: %timeit hashlib.sha1(img.tobytes())
10 loops, best of 3: 36.3 ms per loop
In [28]: %timeit px =np.ascontiguousarray(np.asarray(img).ravel()[::16]); hashlib.sha1(px)
100 loops, best of 3: 16.9 ms per loop
事实证明,numpy数组的速度大约是其两倍。但是 - 它只使用1/16的数据。我不确定你使用的是哈希,但我建议你只需要使用整个图像20多秒。