Question

我正在使用cnn_dailymail的数据集TensorFlow Datasets。我的目标是在对数据集应用一些文本预处理步骤后对数据集进行标记。

我按如下方式访问和预处理数据集：

!pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow as tf
import tensorflow_datasets as tfds

data, info = tfds.load('cnn_dailymail', with_info=True)
train_data, test_data = data['train'], data['test']

def map_fn(x, start=tf.constant('<start>'), end=tf.constant('<end>')):
   strings = [start, x['highlights'], end]
   x['highlights'] = tf.strings.join(strings, separator=' ')
   return x

train_data_preproc = train_data.map(map_fn)
elem, = train_data_preproc.take(1)
elem['highlights'].numpy()
# b'<start> mother announced as imedeen ambassador . ...

为了标记数据集，我遇到了tfds.features.text.Tokenizer函数（另请参见here）。但是，这并不符合我希望的方式：

tokenizer = tfds.features.text.Tokenizer(alphanum_only=False, reserved_tokens=['<start>', '<end>'])
tokenizer.tokenize(elem['highlights'].numpy())
# ['<start>', ' ', 'mother', ' ', 'announced', ' ', 'as', ' ', 'imedeen', ' ', 'ambassador', ' . ',...]

我希望令牌生成器仅在空白上拆分，而不是将空白视为单独的令牌。有没有办法做到这一点？最好创建我自己的令牌生成器函数，然后使用dataset.map()函数来应用它吗？谢谢！

Answer 1

对于在此部分中苦苦挣扎的任何人，Tensorflow都明确指出：https://www.tensorflow.org/tutorials/load_data/text#encode_text_lines_as_numbers

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

X = ... # list of string
y = ... # list of corresponding labels

train_data = tf.data.Dataset.from_tensor_slices((X, y))

# Building vocabulary set for tokenizer
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in train_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

# Encoding functions
def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):
  # py_func doesn't set the shape of the returned tensors.
  encoded_text, label = tf.py_function(encode, 
                                       inp=[text, label], 
                                       Tout=(tf.int64, tf.int64))

  # `tf.data.Datasets` work best if all components have a shape set
  #  so set the shapes manually: 
  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label


train_data_tokenized = train_data.map(encode_map_fn)

其中train_data是由句子和标签组成的tf.data.Dataset对象。

Answer 2

对于点击此链接的读者...

请在Tensorlfow中找到可能有助于令牌化的要点。

链接：https://gist.github.com/Mageswaran1989/70fd26af52ca4afb86e611f84ac83e97#file-text_preprocessing-ipynb

有不同的可用选项：

Tensorflow数据集API：令牌生成器+ Enoder
Tensorflow Keras文本预处理：多合一的分词器
- API：https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer?version=stable
- 教程：https://www.tensorflow.org/tutorials/text/nmt_with_attention
在我的试用版中，此单词在单词和字符级别的标记和编码/解码方面都非常简单易用
Tensorflow文字直接用于TF Dataset API和Keras Layers的更高级用法。

如何将令牌化应用于TensorFlow数据集？

2 个答案: