使用tfrecord但文件太大

时间:2019-06-03 12:16:42

标签: tensorflow

我正在尝试从一个numpy数组文件夹创建一个tfrecord,该文件夹包含大约2000个numpy文件,每个文件大小为50mb。

def convert(image_paths,out_path):
    # Args:
    # image_paths   List of file-paths for the images.
    # labels        Class-labels for the images.
    # out_path      File-path for the TFRecords output file.    
    print("Converting: " + out_path)
    # Number of images. Used when printing the progress.
    num_images = len(image_paths)    
    # Open a TFRecordWriter for the output-file.
    with tf.python_io.TFRecordWriter(out_path) as writer:        
        # Iterate over all the image-paths and class-labels.
        for i, (path) in enumerate(image_paths):
            # Print the percentage-progress.
            print_progress(count=i, total=num_images-1)
            # Load the image-file using matplotlib's imread function.
            img = np.load(path)
            # Convert the image to raw bytes.
            img_bytes = img.tostring()
            # Create a dict with the data we want to save in the
            # TFRecords file. You can add more relevant data here.
            data = \
                {
                    'image': wrap_bytes(img_bytes)
                }
            # Wrap the data as TensorFlow Features.
            feature = tf.train.Features(feature=data)
            # Wrap again as a TensorFlow Example.
            example = tf.train.Example(features=feature)
            # Serialize the data.
            serialized = example.SerializeToString()        
            # Write the serialized data to the TFRecords file.
            writer.write(serialized)

我认为它可以转换大约200个文件,然后我得到它

Converting: tf.recordtrain
- Progress: 3.6%Traceback (most recent call last):
  File "tf_record.py", line 71, in <module>
out_path=path_tfrecords_train)
  File "tf_record.py", line 54, in convert
writer.write(serialized)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/tf_record.py", line 236, in write
self._writer.WriteRecord(record, status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: tf.recordtrain; File too large

任何解决此问题的建议都会有所帮助,谢谢。

1 个答案:

答案 0 :(得分:0)

我不确定对tfrecords的限制是什么,但是假设您有足够的磁盘空间,更常见的方法是将数据集存储在多个tfrecords文件中,例如将每20个numpy文件存储在另一个tfrecords文件中。