使用通过带有shuffle参数的tf.data.Dataset.list_files()加载的tfrecords时出现意外结果

时间:2019-04-22 00:46:54

标签: python tensorflow tfrecord

我希望弄清shuffletf.data.Dataset.list_files()参数的工作方式。文档指出,当shuffle=True时,文件名将被随机重排。我已经使用已经使用tf.data.Dataset.list_files()加载的tfrecords数据集进行了模型预测,并且无论文件的顺序如何(例如,随机播放是True还是False),我都希望精度指标是相同的,但另有看法。

这是预期的行为,还是我的代码或解释有问题?我下面有可复制的示例代码。

奇怪的是,只要最初设置了tf.random.set_random_seed()(似乎设置种子值甚至都没有关系),那么无论{中的随机播放是True还是False,预测结果都是相同的{1}}。

tensorflow == 1.13.1,keras == 2.2.4

感谢您的澄清!

编辑:重新思考一下,想知道list_files()是否是一个单独且独立的调用?

Y = [y[0] for _ in range(steps) for y in sess.run(Y)]

将数据集拆分为多个tfrecords文件,以便稍后使用list_files()重新加载它:

# Fit and Save a Dummy Model
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn import datasets, metrics

seed = 7
np.random.seed(seed)
tf.random.set_random_seed(seed)

dataset = datasets.load_iris()

X = dataset.data
Y = dataset.target
dummy_Y = np_utils.to_categorical(Y)

# 150 rows
print(len(X))

model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, dummy_Y, epochs=10, batch_size=10,  verbose=2)
model.save('./iris/iris_model')

predictions = model.predict(X)
predictions = np.argmax(predictions, axis=1)

# returns accuracy = 0.3466666666666667
print(metrics.accuracy_score(y_true=Y, y_pred=predictions))

这时,我退出(ipython)并再次重新启动:

numrows = 15
for i, j in enumerate(range(0, len(X), numrows)):
    with tf.python_io.TFRecordWriter('./iris/iris{}.tfrecord'.format(i)) as writer:
        for x, y in zip(X[j:j+numrows, ], Y[j:j+numrows, ]):
            features = tf.train.Features(feature=
                {'X': tf.train.Feature(float_list=tf.train.FloatList(value=x)), 
                'Y': tf.train.Feature(int64_list=tf.train.Int64List(value=[y]))
                })
            example = tf.train.Example(features=features)
            writer.write(example.SerializeToString())
import numpy as np
import tensorflow as tf
from keras.models import load_model
from sklearn import metrics

model = load_model('./iris/iris_model')

batch_size = 10
steps = int(150/batch_size)
file_pattern = './iris/iris*.tfrecord'

feature_description = {
    'X': tf.FixedLenFeature([4], tf.float32),
    'Y': tf.FixedLenFeature([1], tf.int64)
}

def _parse_function(example_proto):
    return tf.parse_single_example(example_proto, feature_description)

def load_data(filenames, batch_size):
    raw_dataset = tf.data.TFRecordDataset(filenames)
    dataset = raw_dataset.map(_parse_function)
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.prefetch(2)
    iterator = dataset.make_one_shot_iterator()
    record = iterator.get_next()
    return record['X'], record['Y']

def get_predictions_accuracy(filenames):
    X, Y = load_data(filenames=filenames, batch_size=batch_size)

    predictions = model.predict([X], steps=steps)
    predictions = np.argmax(predictions, axis=1)
    print(len(predictions))

    with tf.Session() as sess:
        Y = [y[0] for _ in range(steps) for y in sess.run(Y)]

    print(metrics.accuracy_score(y_true=Y, y_pred=predictions))
# No shuffle results:
# Returns expected accuracy = 0.3466666666666667
filenames_noshuffle = tf.data.Dataset.list_files(file_pattern=file_pattern, shuffle=False)
get_predictions_accuracy(filenames_noshuffle)
# Shuffle results, no seed value set:
# Returns UNEXPECTED accuracy (non-deterministic value)
filenames_shuffle_noseed = tf.data.Dataset.list_files(file_pattern=file_pattern, shuffle=True)
get_predictions_accuracy(filenames_shuffle_noseed)

0 个答案:

没有答案