Question

I am building a tensorflow model where the input data is a big scipy sparse matrix, each row being a sample of dimension >50k where only few hundreds values are non-zero.

Currently, I store this matrix as a pickle, then load it fully into memory, batch it and converting the samples in the batch to a dense numpy array that I feed into the model. It works ok as soon as the whole data can fit into memory but this method is not tractable when I want to use way more data.

I have investigated TFRecords as a way to serialize my data and read it more efficiently with tensorflow but I can't find any example with sparse data.

I found an example for mnist:

writer = tf.python_io.TFRecordWriter("mnist.tfrecords")
# construct the Example protob oject
example = tf.train.Example(
    # Example contains a Features proto object
    features=tf.train.Features(
      # Features contains a map of string to Feature proto objects
      feature={
        # A Feature contains one of either a int64_list,
        # float_list, or bytes_list
        'label': tf.train.Feature(
            int64_list=tf.train.Int64List(value=[label])),
        'image': tf.train.Feature(
            int64_list=tf.train.Int64List(value=features.astype("int64"))),
}))
# use the proto object to serialize the example to a string
serialized = example.SerializeToString()
# write the serialized object to disk
writer.write(serialized)

where label is an int and features is a np.array of length 784 representing each pixel of the image as a float. I understand this approach but I can't really reproduce it since converting each row of my sparse matrix to a dense np.array would be untractable as well.

I think I need to create a key for each feature (column) and specify only the non zero ones for each example but I'm not sure it is possible to specify a default value (0 in my case) for "missing" features.

What would be the best way to do that?

Encoding a scipy sparse matrix as TFRecords

0 个答案: