考虑以下batched_dataset
:
samples = ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"},
{"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
{"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
{"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
])
dataset = tf.data.Dataset.from_generator(
lambda: samples, {"query": tf.string, "doc": tf.string})
batched_dataset = dataset.batch(2)
#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
# [b'this is one relevant document regarding query 1',
# b'this is one relevant document regarding query 2'], dtype=object)>,
#
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
# [b'this is a query 1',
# b'this is a query 2'], dtype=object)>
#}
和用于对此batched_dataset
进行标记的映射函数:
def tokenize(sample):
tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
return (tokenized_query, tokenized_doc)
我可以使用for循环标记整个batched_dataset:
for batch in batched_dataset:
tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[ 101, 2023, 2003, 1037, 23032, 1015, 102, 0],
# [ 101, 2023, 2003, 1037, 23032, 1016, 102, 0]],
# dtype=int32)>,
# 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[1, 1, 1, 1, 1, 1, 1, 0],
# [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>},
# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
# array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953, 102],
# [ 101, 2023, 2003, 2028, 7882, 6254, 4953, 102]], dtype=int32)>,
# 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
# ...
但是,使用tf.data.Dataset.map
时会出现以下错误:
tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'
然后,如何正确地将标记化器映射函数应用于批处理数据集?
注意:我在Google Colab
上发布了一个工作示例。