假设我有以下熊猫数据框:
| col1 | col2 | col3 |
---------------------------------
0 | 5 | 4 | [0,2,4] |
1 | 3 | 8 | [7,3] |
2 | 2 | 1 | [7,3,6,9] |
'col3'中的我有不同大小的列表,我也想将其导入到Tensorflow Estimator中。这些列表还必须经过k-hot编码,我不确定这是否是人们所说的:
[1,4,6] ---> [0, 1, 0, 0, 1, 0, 1]
问题来了,因为col3中的最大数量为600_000,所以我的k-hot编码矢量的大小将为600_000,因此我无法对我的整个数据帧进行编码(由于MemoryError),并且将 col3 传递给Tensorflow
tf.feature_column.numeric_column('col3', 600_000)
您有任何想法如何在DNNRegressor中填充此列?要共享一些代码,这是我通常针对“标准”列执行的操作:
# reading the columns from the pandas_input_fn
col1 = tf.feature_column.numeric_column('col1', default_value=0.0)
col2 = tf.feature_column.numeric_column('col2', default_value=0.0)
# converting them to categorical
col1_b = tf.feature_column.bucketized_column(col1, [0,5,10,20])
col1_2 = tf.feature_column.bucketized_column(col2, [0,4,8,16])
# make crosses
cross = tf.feature_column.crossed_colum([col1_b, col2_b], 4*4)
# define what will go in my estimator
features = [
tf.feature_column.embedding_column(cross, 10)
tf.feature_column.indicator_column(col1_b)
tf.feature_column.indicator_cilumn(col2_b)
]
# and finally
estimator = tf.estimator.DNNRegressor(
model_dir=model_dir,
feature_columns=features,
hidden_units=[512,256,256,128,32])