Question

假设我有以下熊猫数据框：

   | col1 | col2  |     col3    |
---------------------------------
0  | 5    |  4    | [0,2,4]     |
1  | 3    |  8    | [7,3]       |
2  | 2    |  1    | [7,3,6,9]   |

'col3'中的

我有不同大小的列表，我也想将其导入到Tensorflow Estimator中。这些列表还必须经过k-hot编码，我不确定这是否是人们所说的：

[1,4,6] ---> [0, 1, 0, 0, 1, 0, 1]

问题来了，因为col3中的最大数量为600_000，所以我的k-hot编码矢量的大小将为600_000，因此我无法对我的整个数据帧进行编码（由于MemoryError），并且将 col3 传递给Tensorflow

tf.feature_column.numeric_column('col3', 600_000)

您有任何想法如何在DNNRegressor中填充此列？要共享一些代码，这是我通常针对“标准”列执行的操作：

# reading the columns from the pandas_input_fn
col1 = tf.feature_column.numeric_column('col1', default_value=0.0)
col2 = tf.feature_column.numeric_column('col2', default_value=0.0)

# converting them to categorical
col1_b = tf.feature_column.bucketized_column(col1, [0,5,10,20])
col1_2 = tf.feature_column.bucketized_column(col2, [0,4,8,16])

# make crosses
cross = tf.feature_column.crossed_colum([col1_b, col2_b], 4*4)

# define what will go in my estimator
features = [
   tf.feature_column.embedding_column(cross, 10)
   tf.feature_column.indicator_column(col1_b)
   tf.feature_column.indicator_cilumn(col2_b)
]

# and finally
estimator = tf.estimator.DNNRegressor(
              model_dir=model_dir, 
              feature_columns=features, 
              hidden_units=[512,256,256,128,32])

k热编码特征Tensorflow Estimator

0 个答案: