Question

我有几个hdf5文件，想要使用生成器（使用h5py）在python中读取它们。然而，问题在于，当我尝试从HDF5文件中读取时，首先它非常快（每次调用data_generator时为0.08秒），并且在我调用data_generator大约10次后，它将会每次拨打data_generator时变得像5秒（慢100倍）。然而，10次没有固定 - 有时可能是11次，有时可能是9次。

代码是这样的：

def data_generator(start, end):
  batch_size = 32 # hard coee for demo purpose
  size = 1000 # hard code for demo purpose
  for i in range(start, end):
      for j in range(int(size/batch_size)): # for all examples
          print("i is", i, "j is", j)
          if ((j+1)*batch_size <= size):
              f1 = h5py.File(os.path.join(processed_data_folder, "features_" + str(i) + ".hdf5"), 'r')
              X = f1["features"][j*batch_size:(j+1)*batch_size]
              f2 = h5py.File(os.path.join(processed_data_folder, "labels_" + str(i) + ".hdf5"), 'r')
              y = f2["labels"][j*batch_size:(j+1)*batch_size]                
              yield X, y

当我创建HDF5文件时，我使用如下代码：

def parse_data_frame_hdf5_without_scaling(df, fold_num):
  labels = []
  labels_file = os.path.join(processed_data_folder, "labels_" + str(fold_num) + '.hdf5')
  for index, row in df.iterrows():
      try:
          im = imread(os.path.join(data_folder, row['Image Index']))
      except IOError:
          print("can't find file or read data" + row['Image Index'])
      else:
          im = resize(im, output_shape = (resized_height, resized_width), mode = 'constant', preserve_range = True)

          im = im.T.flatten()[:, np.newaxis].T
          # print("shape of im is", im.shape, "type is", type(im))
          create_or_append_hdf5(im, fold_num)
          labels.append(row['isPneumonia'])

  with h5py.File(labels_file, 'w') as hf:
      hf.create_dataset("labels",  data=labels)
  print("labels shape is",len(labels), "fold ", fold_num)

create_or_append_hdf5是这样的：

def create_or_append_hdf5(img, fold_num):
  img = np.asarray(img).reshape(len(img),resized_height,resized_width,1)
  feature_file = os.path.join(processed_data_folder, "features_" + str(fold_num) + '.hdf5')
  if(os.path.exists(feature_file)):
      with h5py.File(feature_file, 'a') as hf:
          hf["features"].resize((hf["features"].shape[0] + img.shape[0]), axis = 0)
          hf["features"][-img.shape[0]:] = img
  else:        
      with h5py.File(feature_file, 'w') as hf:
          hf.create_dataset("features",  data = img, maxshape=(None, resized_height, resized_width, 1))

最初我认为这就像磁盘限制问题（因为我在Azure上使用VM），但后来我调查了IOPS和磁盘读取吞吐量，所有这些都在Azure VM允许值的30％之内;然后我怀疑在创建我的HDF5文件时可能会出现一些性能问题，但是如果是这种情况，那么它会在前几个实例中快速读取。

我也在想也许HDF5文件中的数据变化很大，因此磁盘读取时间会有很大变化，但是我检查了数据并没有找到任何特殊数据。顺便说一句，所有数据都是浮点数32（32,512,512,1）numpy数组（用于图像处理）。

我真的很挣扎，因为它会大大降低性能 - 但我无法弄清楚原因。任何指针将不胜感激！

最初快速读取HDF5文件然后变得非常慢

0 个答案: