Question

我想从一系列令牌中生成 n-gram ：

bigram:: "1 3 4 5" --> { (1,3), (3,4), (4,5) }

搜索后，我找到了使用的this主题：

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

如果我在训练期间使用这段代码，我认为它会因为for-loop而导致深度学习库中的性能下降。所以我寻找更好的选项，如lambda函数或类似的东西（它也可以在预处理步骤中生成所有序列，但我认为它不是一种优雅的方式......）

Answer 1

就我而言，我发现Pooling是一个解决方案：

AveragePooling1D(pool_size=2, strides=1, padding='same')

如果您需要以字符串格式生成 bigram ：

import tensorflow as tf

tf.enable_eager_execution()

sentence = ['this is example sentence']
x = tf.string_split(sentence).values[:-1] + ' ' + tf.string_split(sentence).values[1:]

# tf.Tensor([b'this is' b'is example' b'example sentence'], shape=(3,), dtype=string)

您还可以使用tensorflow-transform生成ngrams。

import tensoflow-transform as tft

tft.ngrams(tensor, (1,1))

注意：tensorflow-transform仅支持截至2019年1月22日的python 2.

在Keras / Tensorflow中生成ngram（bigram或trigram）

1 个答案: