最初快速读取HDF5文件然后变得非常慢

时间:2017-12-03 23:24:46

标签: python performance hdf5 h5py

我有几个hdf5文件,想要使用生成器(使用h5py)在python中读取它们。然而,问题在于,当我尝试从HDF5文件中读取时,首先它非常快(每次调用data_generator时为0.08秒),并且在我调用data_generator大约10次后,它将会每次拨打data_generator时变得像5秒(慢100倍)。然而,10次没有固定 - 有时可能是11次,有时可能是9次。

代码是这样的:

def data_generator(start, end):
  batch_size = 32 # hard coee for demo purpose
  size = 1000 # hard code for demo purpose
  for i in range(start, end):
      for j in range(int(size/batch_size)): # for all examples
          print("i is", i, "j is", j)
          if ((j+1)*batch_size <= size):
              f1 = h5py.File(os.path.join(processed_data_folder, "features_" + str(i) + ".hdf5"), 'r')
              X = f1["features"][j*batch_size:(j+1)*batch_size]
              f2 = h5py.File(os.path.join(processed_data_folder, "labels_" + str(i) + ".hdf5"), 'r')
              y = f2["labels"][j*batch_size:(j+1)*batch_size]                
              yield X, y

当我创建HDF5文件时,我使用如下代码:

def parse_data_frame_hdf5_without_scaling(df, fold_num):
  labels = []
  labels_file = os.path.join(processed_data_folder, "labels_" + str(fold_num) + '.hdf5')
  for index, row in df.iterrows():
      try:
          im = imread(os.path.join(data_folder, row['Image Index']))
      except IOError:
          print("can't find file or read data" + row['Image Index'])
      else:
          im = resize(im, output_shape = (resized_height, resized_width), mode = 'constant', preserve_range = True)

          im = im.T.flatten()[:, np.newaxis].T
          # print("shape of im is", im.shape, "type is", type(im))
          create_or_append_hdf5(im, fold_num)
          labels.append(row['isPneumonia'])

  with h5py.File(labels_file, 'w') as hf:
      hf.create_dataset("labels",  data=labels)
  print("labels shape is",len(labels), "fold ", fold_num)

create_or_append_hdf5是这样的:

def create_or_append_hdf5(img, fold_num):
  img = np.asarray(img).reshape(len(img),resized_height,resized_width,1)
  feature_file = os.path.join(processed_data_folder, "features_" + str(fold_num) + '.hdf5')
  if(os.path.exists(feature_file)):
      with h5py.File(feature_file, 'a') as hf:
          hf["features"].resize((hf["features"].shape[0] + img.shape[0]), axis = 0)
          hf["features"][-img.shape[0]:] = img
  else:        
      with h5py.File(feature_file, 'w') as hf:
          hf.create_dataset("features",  data = img, maxshape=(None, resized_height, resized_width, 1))

最初我认为这就像磁盘限制问题(因为我在Azure上使用VM),但后来我调查了IOPS和磁盘读取吞吐量,所有这些都在Azure VM允许值的30%之内;然后我怀疑在创建我的HDF5文件时可能会出现一些性能问题,但是如果是这种情况,那么它会在前几个实例中快速读取。

我也在想也许HDF5文件中的数据变化很大,因此磁盘读取时间会有很大变化,但是我检查了数据并没有找到任何特殊数据。顺便说一句,所有数据都是浮点数32(32,512,512,1)numpy数组(用于图像处理)。

我真的很挣扎,因为它会大大降低性能 - 但我无法弄清楚原因。任何指针将不胜感激!

0 个答案:

没有答案