我有两个功能post_tags
和user_tags
。这些是N和M字的填充字符串。
例如,我们可能post_tag
为"the-oscars brad-pitt xyzpadxyz xyzpadxyz xyzpadxyz"
。所以这篇文章被标记为oscars和brad pitt相关。
然后,对于user_tag
,我们可能会"brad-pitt universal sag the-academy-awards xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz"
user_tag
这里是用户上次消费的最近20个帖子的标签示例。
所以post_tag
总是5个标签长,而user_tag
总是10个标签长。
我将每个字符串拆分为张量,作为数据集api处理的一部分,如下所示:
features['post_tag'] = tf.string_split([features['post_tag']])
features['post_tag'] = tf.sparse_tensor_to_dense(features['post_tag'], default_value=PADWORD)
features['user_tag'] = tf.string_split([features['post_tag']])
features['user_tag'] = tf.sparse_tensor_to_dense(features['post_tag'], default_value=PADWORD)
我有一个标签的词汇文件,并将每个功能都输入到一个宽而深的罐头估算器的深层特征中。
user_tag = tf.feature_column.categorical_column_with_vocabulary_file(
key='user_tag',
vocabulary_file='{}/tagvocab.csv'.format(INPUT_DIR)
)
user_tag_embed = tf.feature_column.embedding_column(
categorical_column = user_tag ,
dimension = USER_TAG_EMBEDDING_SIZE
)
deep.append(user_tag_embed)
post_tag = tf.feature_column.categorical_column_with_vocabulary_file(
key='post_tag',
vocabulary_file='{}/tagvocab.csv'.format(INPUT_DIR)
)
post_tag _embed = tf.feature_column.embedding_column(
categorical_column = post_tag ,
dimension = POST_TAG_EMBEDDING_SIZE
)
deep.append(post_tag_embed)
然而,我真正想做的是做post_tag
和user_tag
的交叉。但如果我做这样的事情:
user_post_tag_cross = tf.feature_column.crossed_column(
keys = [user_tag, post_tag],
hash_bucket_size = 25
)
wide.append(user_post_tag_cross)
我收到了这个错误:
InvalidArgumentError (see above for traceback): Expected D2 of index to be 2 got 3 at position 0
我可以看到这来自this line in the tf code
我有一种感觉,也许这样的交叉张量可能是不可能的,或者它只期望两个字符串。我有post_tag_first
和user_tag_first
的相关功能,它们只是一个随机标记,只是作为字符串传入。如果我这样做:
user_post_first_tag_cross = tf.feature_column.crossed_column(
keys = ['user_tag_first', 'post_tag_first'],
hash_bucket_size = 25
)
wide.append(user_post_first_tag_cross)
它工作和训练很好。所以我真的只是想了解做两个标签张量的最佳方法是什么。
任何人都有任何想法 - 我认为这是某人可能已经处理过的事情,因为这种标签数据非常常见,或者可能是搜索字词和文档字词,关键字等。
这是一个小型的可运行示例,用于设置我要做的事情。
import tensorflow as tf
import numpy as np
# set up tag vocab
tag_vocab = ['brad-pitt','usa','donald-trump','tpb','tensorflow','xyzpadxyz','UNK']
# make data
# train data
user_tag_train = np.random.choice(tag_vocab,50).reshape((5, 10)) # 5 example rows with 10 tags per user
post_tag_train = np.random.choice(tag_vocab,25).reshape((5, 5)) # 5 example rows with 5 tags per post
x_train = np.array([1., 2., 3., 4., 5.]) # just another example numeric feature
y_train = np.array([0., -1., -2., -3., -4]) # outcome
# eval data
user_tag_eval = np.random.choice(tag_vocab,50).reshape((5, 10)) # 5 example rows with 10 tags per user
post_tag_eval = np.random.choice(tag_vocab,25).reshape((5, 5)) # 5 example rows with 5 tags per post
x_eval = np.array([2., 5., 8., 1., 5.])
y_eval = np.array([-1.01, -4.1, -7, 0., 9.])
# define feature cols
x_num = tf.feature_column.numeric_column("x", shape=[1])
user_tag_cat = tf.feature_column.categorical_column_with_vocabulary_list(
key = 'user_tag',
vocabulary_list = tag_vocab)
user_tag_embed = tf.feature_column.embedding_column(
categorical_column = user_tag_cat,
dimension = 3
)
post_tag_cat = tf.feature_column.categorical_column_with_vocabulary_list(
key = 'post_tag',
vocabulary_list = tag_vocab)
post_tag_embed = tf.feature_column.embedding_column(
categorical_column = post_tag_cat,
dimension = 2
)
user_post_tag_cross = tf.feature_column.crossed_column(
keys = [ user_tag_cat, post_tag_cat ],
hash_bucket_size = 5
)
#user_post_tag_embed_cross = tf.feature_column.crossed_column(
# keys = [ user_tag_embed, post_tag_embed ],
# hash_bucket_size = 5
#)
feature_columns = [
x_num,
user_tag_cat,
post_tag_cat,
user_tag_embed,
post_tag_embed,
user_post_tag_cross,
#user_post_tag_embed_cross,
]
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)
input_fn = tf.estimator.inputs.numpy_input_fn(
{
"x": x_train,
"user_tag": user_tag_train,
"post_tag": post_tag_train,
},
y_train,
batch_size=5,
num_epochs=None,
shuffle=True
)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
{
"x": x_train,
"user_tag": user_tag_train,
"post_tag": post_tag_train,
},
y_train,
batch_size=5,
num_epochs=1000,
shuffle=False
)
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
{
"x": x_eval,
"user_tag": user_tag_eval,
"post_tag": post_tag_eval
},
y_eval,
batch_size=5,
num_epochs=1000,
shuffle=False
)
estimator.train(input_fn=input_fn, steps=1000)
train_metrics = estimator.evaluate(input_fn=train_input_fn)
eval_metrics = estimator.evaluate(input_fn=eval_input_fn)
print("\n\ntrain metrics: %r"% train_metrics)
print("eval metrics: %r"% eval_metrics)
以上代码会运行,但是如果您取消注释与user_post_tag_embed_cross
相关的部分(正如预期的那样):
ValueError: Unsupported key type. All keys must be either string, or categorical column except _HashedCategoricalColumn. Given: _EmbeddingColumn(categorical_column=_VocabularyListCategoricalColumn(key='user_tag', vocabulary_list=('brad-pitt', 'usa', 'donald-trump', 'tpb', 'tensorflow', 'xyzpadxyz', 'UNK'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=3, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x0000021C9AA14860>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True)
实际上我认为我不能做user_post_tag_cross
因为标签词汇是〜40k(我很惊讶它在这里工作,因为我认为这是导致我上面链接的错误的原因 - 也许它与你尝试跨过两只大词汇的猫有关。)
我认为理想情况下我想做的只是以某种方式跨越嵌入。如果我将两者放入具有相同维度的嵌入中,那么有没有办法以某种方式使用feature_columns()或其他方法来交叉?