我训练了两个带有预训练数据的fastText模型。
import fasttext
tcn_model = fasttext.train_supervised('tcn_word_list.txt', dim=300, pretrainedVectors='muse/data/cc.zh.300.vec')
tcn_model.save_model('tcn_model.bin')
en_model = fasttext.train_supervised('en_word_list.txt', dim=300, pretrainedVectors='muse/data/cc.en.300.vec')
en_model.save_model('en_model.bin')
此后,当我尝试将这两个模型与MUSE对齐时,出现以下ValueError。
# Unsupervised MUSE training
!python muse/unsupervised.py --src_lang zh --tgt_lang en \
--src_emb tcn_model.vec --tgt_emb en_model.vec \
--n_refinement 5 --normalize_embeddings center \
--dis_most_frequent 0 --emb_dim 300 \
--dico_eval zhen_dict.txt
INFO - 08/01/20 08:14:42 - 0:00:04 - Loaded binary model. Generating embeddings ...
Traceback (most recent call last):
File "muse/unsupervised.py", line 95, in <module>
src_emb, tgt_emb, mapping, discriminator = build_model(params, True)
File "/workspace/torch/muse/src/models.py", line 46, in build_model
src_dico, _src_emb = load_embeddings(params, source=True)
File "/workspace/torch/muse/src/utils.py", line 404, in load_embeddings
return load_bin_embeddings(params, source, full_vocab)
File "/workspace/torch/muse/src/utils.py", line 370, in load_bin_embeddings
embeddings = torch.from_numpy(np.concatenate([model.get_word_vector(w)[None] for w in words], 0))
ValueError: need at least one array to concatenate
查看错误发生的位置后,我运行了以下测试,该测试执行时没有错误。
test = fasttext.load_model('tcn_model.bin')
a = [test.get_word_vector(w) for w in test.get_words()[:2]]
b = [test.get_word_vector(w)[None] for w in test.get_words()[:2]]
emba = torch.from_numpy(np.concatenate(a, 0))
embb = torch.from_numpy(np.concatenate(b, 0))