Question

我正在尝试编写一个python程序，从大样本tiff图像中随机读取一些tiff图像。有趣的是，我发现如果我们使用随机生成器生成索引并获得图像路径列表，与使用硬代码随机索引获取图像路径相比，python倾向于读取tiff图像（浮点值）慢得多并阅读tiff图片。

import datetime
import matplotlib.pyplot as plt
import numpy

def read_in_seq(image_filenames, indices):
    return [ plt.imread(image_filenames[index]) for index in indices ]

image_filenames = []

for index in range(15000):
    image_filenames.append("/tmp/%05d" % index + ".tiff")

# This is generated from numpy.random.choice(15000, 100) but hard coded the values here
indices=[
  3885,   901,  6233,  7234, 10195,  2204,   469,  2906, 12114, 13515, 12977, 5201,
  8829, 11537,  5400,  9633, 10744, 12991,  2593,  3046,  5103,  1901,  8831, 12454,
  9779,  4714, 10839,  8702,  8537,  2136,  5095,  9006, 13293,  9933,  3584, 10818,
  8594, 11032,  3705,   435,  6679,  8349,  6930,  9741, 12933,  3231,  1849,  7871,
 11752,  8361,  3094,  2229, 14303,  2006,  5554,  1492, 14817, 12690, 10648, 14631,
  6401,  6181,  4401,  7222,  9881,  8381,  7603, 11374, 12702,  6881, 11868, 10967,
 14508, 12930,  3542,  1197,  8387, 11253,  1802, 14732,  7419, 11994,  6083,  8846,
  5370,  4276, 13953, 14409,  8197,  8956,  4717,  3262,  2314, 12527,  5394, 12495,
  6708,  9724,   740, 10416]

print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Normal input read started with size=" + str(len(indices)))
output = read_in_seq(image_filenames, indices) # takes 0.8 seconds
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Normal input read completed with size=" + str(len(output)))

indices = numpy.random.choice(15000, 100)
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Random input read started with size=" + str(len(indices)))
output = read_in_seq(image_filenames, indices) # takes ~3 seconds
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f') + ": Random input read completed with size=" + str(len(output)))

这是输出：

2018-01-10 15:30:46.170487: Normal input read started with size=100
2018-01-10 15:30:46.943557: Normal input read completed with size=100
2018-01-10 15:30:46.943718: Random input read started with size=100
2018-01-10 15:30:49.858074: Random input read completed with size=100

所有15000个tiff图像相同，每个~3MB。正如您所看到的，对于15000个tiff图像中的100个tiff图像，使用硬编码随机索引进行的正常输入读取仅需0.8秒。但是，当我们使用随机生成器生成的索引（例如numpy.random）时，需要将近3秒。

另一方面，如果我们修改上面的代码来读取15000张图像中的100 png图像。使用硬编码随机生成索引读取png图像的时间与numpy.random生成的索引（大约4秒）几乎相同。

for index in range(15000):
    image_filenames.append("/tmp/%05d" % index + ".png")
----
2018-01-10 16:20:30.498341: Normal input read started with size=100
2018-01-10 16:20:34.020450: Normal input read completed with size=100
2018-01-10 16:20:34.020602: Random input read started with size=100
2018-01-10 16:20:38.692906: Random input read completed with size=100

请注意，读取tiff图像的时间指标不计算numpy.random所花费的时间（仅计算读取图像的时间read_in_seq）。

让我们假设我们只能使用单线程，请有人解释为什么python在使用随机生成器检索图像路径时读取tiff图像较慢（与硬编码随机索引相比，检索图像路径）？例如它与CPU浮点支持，硬盘搜索，OS设计还是别的什么有关？

使用随机生成器检索图像路径时，为什么python读取tiff图像的速度较慢

0 个答案: