I am building a tensorflow model where the input data is a big scipy sparse matrix, each row being a sample of dimension >50k where only few hundreds values are non-zero.
Currently, I store this matrix as a pickle, then load it fully into memory, batch it and converting the samples in the batch to a dense numpy array that I feed into the model. It works ok as soon as the whole data can fit into memory but this method is not tractable when I want to use way more data.
I have investigated TFRecords as a way to serialize my data and read it more efficiently with tensorflow but I can't find any example with sparse data.
I found an example for mnist:
writer = tf.python_io.TFRecordWriter("mnist.tfrecords")
# construct the Example protob oject
example = tf.train.Example(
# Example contains a Features proto object
features=tf.train.Features(
# Features contains a map of string to Feature proto objects
feature={
# A Feature contains one of either a int64_list,
# float_list, or bytes_list
'label': tf.train.Feature(
int64_list=tf.train.Int64List(value=[label])),
'image': tf.train.Feature(
int64_list=tf.train.Int64List(value=features.astype("int64"))),
}))
# use the proto object to serialize the example to a string
serialized = example.SerializeToString()
# write the serialized object to disk
writer.write(serialized)
where label
is an int
and features is a np.array
of length 784 representing each pixel of the image as a float. I understand this approach but I can't really reproduce it since converting each row of my sparse matrix to a dense np.array
would be untractable as well.
I think I need to create a key for each feature (column) and specify only the non zero ones for each example but I'm not sure it is possible to specify a default value (0 in my case) for "missing" features.
What would be the best way to do that?