我正在尝试编写一个python程序,从大样本tiff图像中随机读取一些tiff图像。有趣的是,我发现如果我们使用随机生成器生成索引并获得图像路径列表,与使用硬代码随机索引获取图像路径相比,python倾向于读取tiff图像(浮点值)慢得多并阅读tiff图片。
import datetime
import matplotlib.pyplot as plt
import numpy
def read_in_seq(image_filenames, indices):
return [ plt.imread(image_filenames[index]) for index in indices ]
image_filenames = []
for index in range(15000):
image_filenames.append("/tmp/%05d" % index + ".tiff")
# This is generated from numpy.random.choice(15000, 100) but hard coded the values here
indices=[
3885, 901, 6233, 7234, 10195, 2204, 469, 2906, 12114, 13515, 12977, 5201,
8829, 11537, 5400, 9633, 10744, 12991, 2593, 3046, 5103, 1901, 8831, 12454,
9779, 4714, 10839, 8702, 8537, 2136, 5095, 9006, 13293, 9933, 3584, 10818,
8594, 11032, 3705, 435, 6679, 8349, 6930, 9741, 12933, 3231, 1849, 7871,
11752, 8361, 3094, 2229, 14303, 2006, 5554, 1492, 14817, 12690, 10648, 14631,
6401, 6181, 4401, 7222, 9881, 8381, 7603, 11374, 12702, 6881, 11868, 10967,
14508, 12930, 3542, 1197, 8387, 11253, 1802, 14732, 7419, 11994, 6083, 8846,
5370, 4276, 13953, 14409, 8197, 8956, 4717, 3262, 2314, 12527, 5394, 12495,
6708, 9724, 740, 10416]
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Normal input read started with size=" + str(len(indices)))
output = read_in_seq(image_filenames, indices) # takes 0.8 seconds
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Normal input read completed with size=" + str(len(output)))
indices = numpy.random.choice(15000, 100)
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Random input read started with size=" + str(len(indices)))
output = read_in_seq(image_filenames, indices) # takes ~3 seconds
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Random input read completed with size=" + str(len(output)))
这是输出:
2018-01-10 15:30:46.170487: Normal input read started with size=100
2018-01-10 15:30:46.943557: Normal input read completed with size=100
2018-01-10 15:30:46.943718: Random input read started with size=100
2018-01-10 15:30:49.858074: Random input read completed with size=100
所有15000个tiff图像相同,每个~3MB。正如您所看到的,对于15000个tiff图像中的100个tiff图像,使用硬编码随机索引进行的正常输入读取仅需0.8秒。但是,当我们使用随机生成器生成的索引(例如numpy.random
)时,需要将近3秒。
另一方面,如果我们修改上面的代码来读取15000张图像中的100 png图像。使用硬编码随机生成索引读取png图像的时间与numpy.random
生成的索引(大约4秒)几乎相同。
for index in range(15000):
image_filenames.append("/tmp/%05d" % index + ".png")
----
2018-01-10 16:20:30.498341: Normal input read started with size=100
2018-01-10 16:20:34.020450: Normal input read completed with size=100
2018-01-10 16:20:34.020602: Random input read started with size=100
2018-01-10 16:20:38.692906: Random input read completed with size=100
请注意,读取tiff图像的时间指标不计算numpy.random
所花费的时间(仅计算读取图像的时间read_in_seq
)。
让我们假设我们只能使用单线程,请有人解释为什么python在使用随机生成器检索图像路径时读取tiff图像较慢(与硬编码随机索引相比,检索图像路径)?例如它与CPU浮点支持,硬盘搜索,OS设计还是别的什么有关?