我有几个hdf5文件,想要使用生成器(使用h5py)在python中读取它们。然而,问题在于,当我尝试从HDF5文件中读取时,首先它非常快(每次调用data_generator
时为0.08秒),并且在我调用data_generator
大约10次后,它将会每次拨打data_generator
时变得像5秒(慢100倍)。然而,10次没有固定 - 有时可能是11次,有时可能是9次。
代码是这样的:
def data_generator(start, end):
batch_size = 32 # hard coee for demo purpose
size = 1000 # hard code for demo purpose
for i in range(start, end):
for j in range(int(size/batch_size)): # for all examples
print("i is", i, "j is", j)
if ((j+1)*batch_size <= size):
f1 = h5py.File(os.path.join(processed_data_folder, "features_" + str(i) + ".hdf5"), 'r')
X = f1["features"][j*batch_size:(j+1)*batch_size]
f2 = h5py.File(os.path.join(processed_data_folder, "labels_" + str(i) + ".hdf5"), 'r')
y = f2["labels"][j*batch_size:(j+1)*batch_size]
yield X, y
当我创建HDF5文件时,我使用如下代码:
def parse_data_frame_hdf5_without_scaling(df, fold_num):
labels = []
labels_file = os.path.join(processed_data_folder, "labels_" + str(fold_num) + '.hdf5')
for index, row in df.iterrows():
try:
im = imread(os.path.join(data_folder, row['Image Index']))
except IOError:
print("can't find file or read data" + row['Image Index'])
else:
im = resize(im, output_shape = (resized_height, resized_width), mode = 'constant', preserve_range = True)
im = im.T.flatten()[:, np.newaxis].T
# print("shape of im is", im.shape, "type is", type(im))
create_or_append_hdf5(im, fold_num)
labels.append(row['isPneumonia'])
with h5py.File(labels_file, 'w') as hf:
hf.create_dataset("labels", data=labels)
print("labels shape is",len(labels), "fold ", fold_num)
create_or_append_hdf5是这样的:
def create_or_append_hdf5(img, fold_num):
img = np.asarray(img).reshape(len(img),resized_height,resized_width,1)
feature_file = os.path.join(processed_data_folder, "features_" + str(fold_num) + '.hdf5')
if(os.path.exists(feature_file)):
with h5py.File(feature_file, 'a') as hf:
hf["features"].resize((hf["features"].shape[0] + img.shape[0]), axis = 0)
hf["features"][-img.shape[0]:] = img
else:
with h5py.File(feature_file, 'w') as hf:
hf.create_dataset("features", data = img, maxshape=(None, resized_height, resized_width, 1))
最初我认为这就像磁盘限制问题(因为我在Azure上使用VM),但后来我调查了IOPS和磁盘读取吞吐量,所有这些都在Azure VM允许值的30%之内;然后我怀疑在创建我的HDF5文件时可能会出现一些性能问题,但是如果是这种情况,那么它会在前几个实例中快速读取。
我也在想也许HDF5文件中的数据变化很大,因此磁盘读取时间会有很大变化,但是我检查了数据并没有找到任何特殊数据。顺便说一句,所有数据都是浮点数32(32,512,512,1)numpy数组(用于图像处理)。
我真的很挣扎,因为它会大大降低性能 - 但我无法弄清楚原因。任何指针将不胜感激!