如何将稀疏数据帧提供给Tensorflow?

时间:2018-07-06 09:17:10

标签: python pandas tensorflow sparse-matrix

我有一个具有〜30000x6000功能的BOW数据集。

我正在尝试减小文件大小,因为当前文件大小> 1GB。

我遵循Tensorflow给出的虹膜数据集示例,但是从Dense变为稀疏数据帧。

我是通过Scikit的tfidfvectoriser获得的,它返回一个csr_matrix,然后将其转换为稀疏数据帧。

输入功能:

def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
#print(features)
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
#print('this is..', dataset)
# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
#print(dataset)
# Return the dataset.
return dataset

稀疏数据帧是这样的:

0 1 2 3 4 5 ... 994995996996997998999

0 0.0000 0.05881 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0

1 0.1907 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0

2 0.0000 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0

3 0.0000 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0

4 0.0000 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0

标签是[0,1,0,0,0 ...]等(二进制分类)的熊猫系列。

如果我将所有的NaN都填写为0并将其用作密集数据集,则可以很好地工作,但是,如果只是NaN,Tensorflow会给我这样的错误:

Traceback (most recent call last):
  File "/Users/william/PycharmProjects/mn-classification/estimator-02/Inputs.py", line 182, in <module>
    input_fn=(lambda:idk.train_input_fn(X_train, y_train, 200)), steps=5000)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 853, in _train_model_default
    input_fn, model_fn_lib.ModeKeys.TRAIN))
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 691, in _get_features_and_labels_from_input_fn
    result = self._call_input_fn(input_fn, mode)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 798, in _call_input_fn
    return input_fn(**kwargs)
  File "/Users/william/PycharmProjects/mn-classification/estimator-02/Inputs.py", line 182, in <lambda>
    input_fn=(lambda:idk.train_input_fn(X_train, y_train, 200)), steps=5000)
  File "/Users/william/PycharmProjects/mn-classification/estimator-02/idk.py", line 9, in train_input_fn
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 235, in from_tensor_slices
    return TensorSliceDataset(tensors)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1036, in __init__
    batch_dim.assert_is_compatible_with(t.get_shape()[0])
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 116, in assert_is_compatible_with
    other))
ValueError: Dimensions 456 and 304 are not compatible

反正让Tensorflow接受吗?

感谢您的帮助!

0 个答案:

没有答案