我有一个具有〜30000x6000功能的BOW数据集。
我正在尝试减小文件大小,因为当前文件大小> 1GB。
我遵循Tensorflow给出的虹膜数据集示例,但是从Dense变为稀疏数据帧。
我是通过Scikit的tfidfvectoriser获得的,它返回一个csr_matrix,然后将其转换为稀疏数据帧。
输入功能:
def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
#print(features)
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
#print('this is..', dataset)
# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
#print(dataset)
# Return the dataset.
return dataset
稀疏数据帧是这样的:
0 1 2 3 4 5 ... 994995996996997998999
0 0.0000 0.05881 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
1 0.1907 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0000 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0000 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0000 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
标签是[0,1,0,0,0 ...]等(二进制分类)的熊猫系列。
如果我将所有的NaN都填写为0并将其用作密集数据集,则可以很好地工作,但是,如果只是NaN,Tensorflow会给我这样的错误:
Traceback (most recent call last):
File "/Users/william/PycharmProjects/mn-classification/estimator-02/Inputs.py", line 182, in <module>
input_fn=(lambda:idk.train_input_fn(X_train, y_train, 200)), steps=5000)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 853, in _train_model_default
input_fn, model_fn_lib.ModeKeys.TRAIN))
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 691, in _get_features_and_labels_from_input_fn
result = self._call_input_fn(input_fn, mode)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 798, in _call_input_fn
return input_fn(**kwargs)
File "/Users/william/PycharmProjects/mn-classification/estimator-02/Inputs.py", line 182, in <lambda>
input_fn=(lambda:idk.train_input_fn(X_train, y_train, 200)), steps=5000)
File "/Users/william/PycharmProjects/mn-classification/estimator-02/idk.py", line 9, in train_input_fn
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 235, in from_tensor_slices
return TensorSliceDataset(tensors)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1036, in __init__
batch_dim.assert_is_compatible_with(t.get_shape()[0])
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 116, in assert_is_compatible_with
other))
ValueError: Dimensions 456 and 304 are not compatible
反正让Tensorflow接受吗?
感谢您的帮助!