在Tensorflow中交叉字符串列表tf.feature_column.crossed_column

时间:2018-04-19 10:26:15

标签: tensorflow google-cloud-ml tensorflow-datasets tensorflow-estimator

我有两个功能post_tagsuser_tags。这些是N和M字的填充字符串。

例如,我们可能post_tag"the-oscars brad-pitt xyzpadxyz xyzpadxyz xyzpadxyz"。所以这篇文章被标记为oscars和brad pitt相关。

然后,对于user_tag,我们可能会"brad-pitt universal sag the-academy-awards xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz"

user_tag这里是用户上次消费的最近20个帖子的标签示例。

所以post_tag总是5个标签长,而user_tag总是10个标签长。

我将每个字符串拆分为张量,作为数据集api处理的一部分,如下所示:

features['post_tag'] = tf.string_split([features['post_tag']])
features['post_tag'] = tf.sparse_tensor_to_dense(features['post_tag'], default_value=PADWORD)
features['user_tag'] = tf.string_split([features['post_tag']])
features['user_tag'] = tf.sparse_tensor_to_dense(features['post_tag'], default_value=PADWORD)

我有一个标签的词汇文件,并将每个功能都输入到一个宽而深的罐头估算器的深层特征中。

user_tag = tf.feature_column.categorical_column_with_vocabulary_file(
            key='user_tag',
            vocabulary_file='{}/tagvocab.csv'.format(INPUT_DIR)
            )
user_tag_embed = tf.feature_column.embedding_column(
        categorical_column = user_tag ,
        dimension = USER_TAG_EMBEDDING_SIZE
        )
deep.append(user_tag_embed)

post_tag = tf.feature_column.categorical_column_with_vocabulary_file(
                key='post_tag',
                vocabulary_file='{}/tagvocab.csv'.format(INPUT_DIR)
                )
    post_tag _embed = tf.feature_column.embedding_column(
            categorical_column = post_tag ,
            dimension = POST_TAG_EMBEDDING_SIZE
            )
    deep.append(post_tag_embed)

然而,我真正想做的是做post_taguser_tag的交叉。但如果我做这样的事情:

user_post_tag_cross = tf.feature_column.crossed_column(
        keys = [user_tag, post_tag],
        hash_bucket_size = 25
        )
wide.append(user_post_tag_cross)

我收到了这个错误:

InvalidArgumentError (see above for traceback): Expected D2 of index to be 2 got 3 at position 0

我可以看到这来自this line in the tf code

我有一种感觉,也许这样的交叉张量可能是不可能的,或者它只期望两个字符串。我有post_tag_firstuser_tag_first的相关功能,它们只是一个随机标记,只是作为字符串传入。如果我这样做:

user_post_first_tag_cross = tf.feature_column.crossed_column(
            keys = ['user_tag_first', 'post_tag_first'],
            hash_bucket_size = 25
            )
wide.append(user_post_first_tag_cross)

它工作和训练很好。所以我真的只是想了解做两个标签张量的最佳方法是什么。

任何人都有任何想法 - 我认为这是某人可能已经处理过的事情,因为这种标签数据非常常见,或者可能是搜索字词和文档字词,关键字等。

更新 - 小型可重复的示例

这是一个小型的可运行示例,用于设置我要做的事情。

import tensorflow as tf
import numpy as np

# set up tag vocab
tag_vocab = ['brad-pitt','usa','donald-trump','tpb','tensorflow','xyzpadxyz','UNK']

# make data

# train data
user_tag_train = np.random.choice(tag_vocab,50).reshape((5, 10)) # 5 example rows with 10 tags per user
post_tag_train = np.random.choice(tag_vocab,25).reshape((5, 5)) # 5 example rows with 5 tags per post
x_train = np.array([1., 2., 3., 4., 5.]) # just another example numeric feature
y_train = np.array([0., -1., -2., -3., -4]) # outcome

# eval data
user_tag_eval = np.random.choice(tag_vocab,50).reshape((5, 10)) # 5 example rows with 10 tags per user
post_tag_eval = np.random.choice(tag_vocab,25).reshape((5, 5)) # 5 example rows with 5 tags per post
x_eval = np.array([2., 5., 8., 1., 5.])
y_eval = np.array([-1.01, -4.1, -7, 0., 9.])

# define feature cols

x_num = tf.feature_column.numeric_column("x", shape=[1])

user_tag_cat = tf.feature_column.categorical_column_with_vocabulary_list(
    key = 'user_tag',
    vocabulary_list = tag_vocab)

user_tag_embed = tf.feature_column.embedding_column(
    categorical_column = user_tag_cat,
    dimension = 3
)

post_tag_cat = tf.feature_column.categorical_column_with_vocabulary_list(
    key = 'post_tag',
    vocabulary_list = tag_vocab)

post_tag_embed = tf.feature_column.embedding_column(
    categorical_column = post_tag_cat,
    dimension = 2
)

user_post_tag_cross = tf.feature_column.crossed_column(
    keys = [ user_tag_cat, post_tag_cat ],
    hash_bucket_size = 5
)

#user_post_tag_embed_cross = tf.feature_column.crossed_column(
#    keys = [ user_tag_embed, post_tag_embed ],
#    hash_bucket_size = 5
#)

feature_columns = [
  x_num,
  user_tag_cat,
  post_tag_cat,
  user_tag_embed,
  post_tag_embed,
  user_post_tag_cross,
  #user_post_tag_embed_cross,
  ]

estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)

input_fn = tf.estimator.inputs.numpy_input_fn(
    {
      "x": x_train,
      "user_tag": user_tag_train,
      "post_tag": post_tag_train,
    }, 
    y_train, 
    batch_size=5, 
    num_epochs=None, 
    shuffle=True
    )

train_input_fn = tf.estimator.inputs.numpy_input_fn(
    {
      "x": x_train,
      "user_tag": user_tag_train,
      "post_tag": post_tag_train,
      }, 
    y_train, 
    batch_size=5, 
    num_epochs=1000, 
    shuffle=False
    )

eval_input_fn = tf.estimator.inputs.numpy_input_fn(
    {
      "x": x_eval,
      "user_tag": user_tag_eval,
      "post_tag": post_tag_eval
      }, 
    y_eval, 
    batch_size=5, 
    num_epochs=1000, 
    shuffle=False
    )

estimator.train(input_fn=input_fn, steps=1000)

train_metrics = estimator.evaluate(input_fn=train_input_fn)

eval_metrics = estimator.evaluate(input_fn=eval_input_fn)

print("\n\ntrain metrics: %r"% train_metrics)
print("eval metrics: %r"% eval_metrics)

以上代码会运行,但是如果您取消注释与user_post_tag_embed_cross相关的部分(正如预期的那样):

ValueError: Unsupported key type. All keys must be either string, or categorical column except _HashedCategoricalColumn. Given: _EmbeddingColumn(categorical_column=_VocabularyListCategoricalColumn(key='user_tag', vocabulary_list=('brad-pitt', 'usa', 'donald-trump', 'tpb', 'tensorflow', 'xyzpadxyz', 'UNK'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=3, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x0000021C9AA14860>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True)

实际上我认为我不能做user_post_tag_cross因为标签词汇是〜40k(我很惊讶它在这里工作,因为我认为这是导致我上面链接的错误的原因 - 也许它与你尝试跨过两只大词汇的猫有关。)

我认为理想情况下我想做的只是以某种方式跨越嵌入。如果我将两者放入具有相同维度的嵌入中,那么有没有办法以某种方式使用feature_columns()或其他方法来交叉?

0 个答案:

没有答案