Question

在标记为重复之前，请先阅读帖子：

我一直在寻找一种有效的方法来计算图像的TFRecord文件中的示例数。由于TFRecord文件不会保存有关文件本身的任何元数据，因此用户必须遍历文件才能计算此信息。

StackOverflow上有一些不同的问题可以回答这个问题。 问题在于他们似乎都使用了DEPRECATED tf.python_io.tf_record_iterator命令，因此这不是一个稳定的解决方案。这是现有帖子的示例：

Obtaining total number of records from .tfrecords file in Tensorflow

Number of examples in each tfrecord

所以我想知道是否有一种方法可以使用新的数据集API计算记录数。

Answer 1

我使用了以下代码，但不使用不推荐使用的命令。希望这会帮助其他人。

使用数据集API进行安装和迭代，然后对其进行循环。不知道这是否最快，但是可以。

count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()

c = 0
for ex in test_counter:
    c += 1
f"There are {c} testing records"

即使在相对较大的文件上，这似乎也可以正常工作。

Answer 2

在Dataset类下列出了一个reduce方法。他们给出了使用以下方法对记录进行计数的示例：

# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5) 
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)

## produces 5

不知道此方法是否比@krishnab的for循环更快。

Answer 3

以下使用TensorFlow 2.1版（使用this answer中的代码）对我有用：

def count_tfrecord_examples(
        tfrecords_dir: str,
) -> int:
    """
    Counts the total number of examples in a collection of TFRecord files.

    :param tfrecords_dir: directory that is assumed to contain only TFRecord files
    :return: the total number of examples in the collection of TFRecord files
        found in the specified directory
    """

    count = 0
    for file_name in os.listdir(tfrecords_dir):
        tfrecord_path = os.path.join(tfrecords_dir, file_name)
        count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))

    return count

Tensorflow：计数TFRecord文件中的示例数-不使用不推荐使用的`tf.python_io.tf_record_iterator`

3 个答案: