我正在使用预先训练的快速文本模型https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md)。
我使用Gensim加载fasttext模型。它可以输出任何单词的向量,无论它是被看到还是看不见(词汇外)。
500px
在张量流中,我知道我可以使用下面的代码来获得所见单词的可训练嵌入:
from gensim.models.wrappers import FastText
en_model = FastText.load_fasttext_format('../wiki.en/wiki.en')
print(en_model['car'])
print(en_model['carcaryou'])
已知单词的索引很容易获得。然而,对于那些看不见的单词,FastText"预测"他们的潜在向量基于子词模式。看不见的单词没有任何索引。
在这种情况下,我应该如何使用tensorflow来处理使用fasttext的已知单词和看不见的单词?
答案 0 :(得分:1)
我发现了使用tf.py_func
的解决方法:
def lookup(arr):
global model
global decode
decoded_arr = decode(arr)
new_arr = np.zeros((*arr.shape, 300))
for s, sent in enumerate(decoded_arr):
for w, word in enumerate(sent):
try:
new_arr[s, w] = model.wv[word]
except Exception as e:
print(e)
new_arr[s, w] = np.zeros(300)
return new_arr.astype(np.float32)
z = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)
这段代码有效(使用法语,抱歉,没关系)
import tensorflow as tf
import numpy as np
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format("../../Tracfin/dev/han/data/embeddings/cc.fr.300.bin")
decode = np.vectorize(lambda x: x.decode("utf-8"))
def lookup(arr):
global model
global decode
decoded_arr = decode(arr)
new_arr = np.zeros((*arr.shape, 300))
for s, sent in enumerate(decoded_arr):
for w, word in enumerate(sent):
try:
new_arr[s, w] = model.wv[word]
except Exception as e:
print(e)
new_arr[s, w] = np.zeros(300)
return new_arr.astype(np.float32)
def extract_words(token):
# Split characters
out = tf.string_split([token], delimiter=" ")
# Convert to Dense tensor, filling with default value
out = tf.reshape(tf.sparse_tensor_to_dense(out, default_value="<pad>"), [-1])
return out
textfile = "text.txt"
words = [
"ceci est un texte hexabromocyclododécanes intéressant qui mentionne des",
"mots connus et des mots inconnus commeceluici ou celui-là polybromobiphényle",
]
with open(textfile, "w") as f:
f.write("\n".join(words))
tf.reset_default_graph()
padded_shapes = tf.TensorShape([None])
padding_values = "<pad>"
dataset = tf.data.TextLineDataset(textfile)
dataset = dataset.map(extract_words, 2)
dataset = dataset.shuffle(10000, reshuffle_each_iteration=True)
dataset = dataset.repeat()
dataset = dataset.padded_batch(3, padded_shapes, padding_values)
iterator = tf.data.Iterator.from_structure(
dataset.output_types, dataset.output_shapes
)
dataset_init_op = iterator.make_initializer(dataset, name="dataset_init_op")
x = iterator.get_next()
z = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)
sess = tf.InteractiveSession()
sess.run(dataset_init_op)
y, w = sess.run([x, z])
y = decode(y)
print(
"\nWords out of vocabulary: ",
np.sum(1 for word in y.reshape(-1) if word not in model.wv.vocab),
)
print("Lookup worked: ", all(model.wv[y[0][0][0]] == w[0][0][0]))
打印:
Words out of vocabulary: 6
Lookup worked: True
我没有尝试优化事物,尤其是查找循环,欢迎发表评论