如何将令牌化器映射函数正确应用于Tensorflow批处理数据集?

时间:2020-04-18 18:58:58

标签: tensorflow tensorflow-datasets huggingface-transformers

考虑以下batched_dataset

samples =  ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"}, 
              {"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
              {"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
              {"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
              ])
dataset = tf.data.Dataset.from_generator( 
    lambda: samples, {"query": tf.string, "doc": tf.string})

batched_dataset = dataset.batch(2)

#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is one relevant document regarding query 1',
#      b'this is one relevant document regarding query 2'], dtype=object)>,
# 
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is a query 1', 
#      b'this is a query 2'], dtype=object)>
#}

和用于对此batched_dataset进行标记的映射函数:

def tokenize(sample):
    tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
    tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
    return (tokenized_query, tokenized_doc) 

我可以使用for循环标记整个batched_dataset:

for batch in batched_dataset:
    tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[  101,  2023,  2003,  1037, 23032,  1015,   102,     0],
#          [  101,  2023,  2003,  1037, 23032,  1016,   102,     0]],
#      dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 0],
#          [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}, 

# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
#   array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102],
#          [ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102]], dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 1],
#          [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
#  ...

但是,使用tf.data.Dataset.map时会出现以下错误:

tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'

然后,如何正确地将标记化器映射函数应用于批处理数据集?

注意:我在Google Colab上发布了一个工作示例。

0 个答案:

没有答案