有可能优雅地做到这一点吗?
现在我唯一想到的就是在3个独立的功能中保存SparseTensor的索引(tf.int64),值(tf.float32)和shape(tf.int64)(前两个是VarLenFeature和最后一个是FixedLenFeature)。这看起来真的很麻烦。
任何建议表示赞赏!
下面的答案不适合构建计算图(b / c,稀疏张量的内容必须通过sess.run()提取,如果重复调用会花费大量时间。)
受mrry's answer的启发,我想也许我们可以获取tf.serialize_sparse
生成的字节,以便稍后我们可以使用tf.deserialize_many_sparse
恢复SparseTensor。但是tf.serialize_sparse
没有在纯python中实现(它调用外部函数SerializeSparse
),这意味着我们仍然需要使用sess.run()
来获取字节。如何获得SerializeSparse
的纯python版本?感谢。
答案 0 :(得分:1)
由于Tensorflow目前仅支持tfrecord中的3种类型:Float,Int64和Bytes,而SparseTensor通常有1种以上类型,我的解决方案是将SparseTensor转换为带有Pickle
的字节。
以下是示例代码:
import tensorflow as tf
import pickle
import numpy as np
from scipy.sparse import csr_matrix
#---------------------------------#
# Write to a tfrecord file
# create two sparse matrices (simulate the values from .eval() of SparseTensor)
a = csr_matrix(np.arange(12).reshape((4,3)))
b = csr_matrix(np.random.rand(20).reshape((5,4)))
# convert them to pickle bytes
p_a = pickle.dumps(a)
p_b = pickle.dumps(b)
# put the bytes in context_list and feature_list
## save p_a in context_lists
context_lists = tf.train.Features(feature={
'context_a': tf.train.Feature(bytes_list=tf.train.BytesList(value=[p_a]))
})
## save p_b as a one element sequence in feature_lists
p_b_features = [tf.train.Feature(bytes_list=tf.train.BytesList(value=[p_b]))]
feature_lists = tf.train.FeatureLists(feature_list={
'features_b': tf.train.FeatureList(feature=p_b_features)
})
# create the SequenceExample
SeqEx = tf.train.SequenceExample(
context = context_lists,
feature_lists = feature_lists
)
SeqEx_serialized = SeqEx.SerializeToString()
# write to a tfrecord file
tf_FWN = 'test_pickle1.tfrecord'
tf_writer1 = tf.python_io.TFRecordWriter(tf_FWN)
tf_writer1.write(SeqEx_serialized)
tf_writer1.close()
#---------------------------------#
# Read from the tfrecord file
# first, define the parse function
def _parse_SE_test_pickle1(in_example_proto):
context_features = {
'context_a': tf.FixedLenFeature([], dtype=tf.string)
}
sequence_features = {
'features_b': tf.FixedLenSequenceFeature([1], dtype=tf.string)
}
context, sequence = tf.parse_single_sequence_example(
in_example_proto,
context_features=context_features,
sequence_features=sequence_features
)
p_a_tf = context['context_a']
p_b_tf = sequence['features_b']
return tf.tuple([p_a_tf, p_b_tf])
# use the Dataset API to read
dataset = tf.data.TFRecordDataset(tf_FWN)
dataset = dataset.map(_parse_SE_test_pickle1)
dataset = dataset.batch(1)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iterator.initializer)
[p_a_bat, p_b_bat] = sess.run(next_element)
# 1st index refers to batch, 2nd and 3rd indices refers to the sequence position (only for b)
rec_a = pickle.loads(p_a_bat[0])
rec_b = pickle.loads(p_b_bat[0][0][0])
# check whether the recovered the same as the original ones.
assert((rec_a - a).nnz == 0)
assert((rec_b - b).nnz == 0)
# print the contents
print("\n------ a -------")
print(a.todense())
print("\n------ rec_a -------")
print(rec_a.todense())
print("\n------ b -------")
print(b.todense())
print("\n------ rec_b -------")
print(rec_b.todense())
这是我得到的:
------ a -------
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
------ rec_a -------
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
------ b -------
[[ 0.88612402 0.51438017 0.20077887 0.20969243]
[ 0.41762425 0.47394715 0.35596051 0.96074408]
[ 0.35491739 0.0761953 0.86217511 0.45796474]
[ 0.81253723 0.57032448 0.94959189 0.10139615]
[ 0.92177499 0.83519464 0.96679833 0.41397829]]
------ rec_b -------
[[ 0.88612402 0.51438017 0.20077887 0.20969243]
[ 0.41762425 0.47394715 0.35596051 0.96074408]
[ 0.35491739 0.0761953 0.86217511 0.45796474]
[ 0.81253723 0.57032448 0.94959189 0.10139615]
[ 0.92177499 0.83519464 0.96679833 0.41397829]]
答案 1 :(得分:1)
我遇到了在TFRecord文件中读写稀疏张量的问题,但我在网上发现的信息很少。
您建议的一种解决方案是将SparseTensor的索引,值和形状存储在3个单独的功能中,将在here中进行讨论。这似乎效率不高或不够优雅。
我有一个有效的示例(带有TensorFlow 2.0.0.alpha0
)。
也许不是最优雅,但它似乎可以工作。
import tensorflow as tf
import numpy as np
# Example data
st_1 = tf.SparseTensor(indices=[[0,0],[1,2]], values=[1,2], dense_shape=[3,4])
st_2 = tf.SparseTensor(indices=[[0,1],[2,0],[3,3]], values=[3,9,5], dense_shape=[4, 4])
sparse_tensors = [st_1, st_2]
# Serialize sparse tensors to an array of byte strings
serialized_sparse_tensors = [tf.io.serialize_sparse(st).numpy() for st in sparse_tensors]
# Write to TFRecord
with tf.io.TFRecordWriter('sparse_example.tfrecord') as tfwriter:
for sst in serialized_sparse_tensors:
sparse_example = tf.train.Example(features =
tf.train.Features(feature=
{'sparse_tensor':
tf.train.Feature(bytes_list=tf.train.BytesList(value=sst))
}))
# Append each example into tfrecord
tfwriter.write(sparse_example.SerializeToString())
def parse_fn(data_element):
features = {'sparse_tensor': tf.io.FixedLenFeature([3], tf.string)}
parsed = tf.io.parse_single_example(data_element, features=features)
# tf.io.deserialize_many_sparse() requires the dimensions to be [N,3] so we add one dimension with expand_dims
parsed['sparse_tensor'] = tf.expand_dims(parsed['sparse_tensor'], axis=0)
# deserialize sparse tensor
parsed['sparse_tensor'] = tf.io.deserialize_many_sparse(parsed['sparse_tensor'], dtype=tf.int32)
# convert from sparse to dense
parsed['sparse_tensor'] = tf.sparse.to_dense(parsed['sparse_tensor'])
# remove extra dimenson [1, 3] -> [3]
parsed['sparse_tensor'] = tf.squeeze(parsed['sparse_tensor'])
return parsed
# Read from TFRecord
dataset = tf.data.TFRecordDataset(['sparse_example.tfrecord'])
dataset = dataset.map(parse_fn)
# Pad and batch dataset
dataset = dataset.padded_batch(2, padded_shapes={'sparse_tensor':[None,None]})
dataset.__iter__().get_next()
这将输出:
{'sparse_tensor': <tf.Tensor: id=295, shape=(2, 4, 4), dtype=int32, numpy=
array([[[1, 0, 0, 0],
[0, 0, 2, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]],
[[0, 3, 0, 0],
[0, 0, 0, 0],
[9, 0, 0, 0],
[0, 0, 0, 5]]], dtype=int32)>}