我刚接触word2vec。我需要微调我的word2vec模型。
我有2个数据集:data1和data2到目前为止,我所做的是:
model = gensim.models.Word2Vec(
data1,
size=size_v,
window=size_w,
min_count=min_c,
workers=work)
model.train(data1, total_examples=len(data1), epochs=epochs)
model.train(data2, total_examples=len(data2), epochs=epochs)
这是正确的吗?我需要将学习到的重量存储在某个地方吗?
我检查了this answer和this one,但我不明白它是如何完成的。
有人可以向我解释要遵循的步骤吗?
提前谢谢
答案 0 :(得分:2)
这正确吗?
是的。您需要确保data1词汇表中的data2单词。如果不是这些单词-单词中没有出现-将会丢失。
请注意,权重将由
计算 model.train(data1, total_examples=len(data1), epochs=epochs)
和
model.train(data2, total_examples=len(data2), epochs=epochs)
不等于
model.train(data1+data2, total_examples=len(data1+data2), epochs=epochs)
我需要将学习到的重量存储在某个地方吗?
不,您不需要。
但是,如果您愿意,可以将权重另存为文件,以便以后使用。
model.save("word2vec.model")
然后您加载它们
model = Word2Vec.load("word2vec.model")
(source)
我需要微调我的word2vec模型。
请注意,“ Word2vec培训是一项不受监督的任务,没有客观地评估结果的好方法。评估取决于您的最终应用 ”。 (source)但是您可以查找一些评估here(“如何测量单词向量的质量” 部分)
希望有帮助!
答案 1 :(得分:2)
请注意,如果您在模型实例化时已经提供了train()
,则不需要用data1
来调用data1
。如果未在实例化中指定一个,则模型将使用默认的build_vocab()
(5)在提供的语料库上完成其内部的train()
和epochs
。
“微调”不是一个简单的过程,需要可靠的步骤来改进模型。这很容易出错。
特别是,如果模型中尚未知道data2
中的单词,则会将其忽略。 (可以选择使用参数build_vocab()
来调用update=True
来扩展已知词汇,但是这样的单词与先前的单词并不完全相等。)
如果data2
仅包含某些单词,则其他data2
中的单词将通过额外的培训进行更新–这实际上可能会将那些与 对齐的单词从仅出现在data1
中的其他单词。 (只有在交错的共享培训课程中一起训练过的单词才会经过“推拉”操作,最终使它们处于有用的安排中。)
增量训练最安全的方法是将data1
和data2
一起洗牌,并对所有数据进行连续训练:以便所有单词一起获得新的交错训练。
答案 2 :(得分:1)
使用gensim训练w2v
模型时,它会存储每个单词的vocab
和index
。
gensim
使用此信息将单词映射到其向量。
如果您要微调已经存在的w2v
模型,则需要确保您的嗓音是一致的。
请参阅所附的代码。
import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator
os.mkdir("model_dir")
# class EpochSaver(CallbackAny2Vec):
# '''Callback to save model after each epoch.'''
# def __init__(self, path_prefix):
# self.path_prefix = path_prefix
# self.epoch = 0
# def on_epoch_end(self, model):
# list_of_existing_files = os.listdir(".")
# output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
# try:
# model.save(output_path)
# except:
# model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
# print("number of epochs completed = {}".format(self.epoch))
# self.epoch += 1
# list_of_total_files = os.listdir(".")
# saver = EpochSaver("my_finetuned")
# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path, limit=None):
embed_shape = (len(token2id), 300)
freqs = np.zeros((len(token2id)), dtype='f')
vectors = np.zeros(embed_shape, dtype='f')
i = 0
with open(path, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if limit is not None and i > limit:
break
vectors[token2id[token]] = np.array(vector, 'f')
i += 1
return vectors
embedding_name = "glove.840B.300d.txt"
data = "<training data(new line separated tect file)>"
# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}
# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}
# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
words = line.strip().split(" ")
training_examples.append(words)
for word in words:
if word not in vocab_freq_dict:
vocab_freq_dict.update({word:0})
vocab_freq_dict[word] += 1
if word not in token2id:
token2id.update({word:id_})
id_ += 1
# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if token not in token2id:
max_token_id += 1
token2id.update({token:max_token_id})
vocab_freq_dict.update({token:1})
with open("vocab_freq_dict","wb") as vocab_file:
pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
pickle.dump(token2id, token2id_file)
# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)
# setting vectors(numpy_array) to None to release memory
vectors = None
params = dict(min_count=1,workers=14,iter=6,size=300)
model = Word2Vec(**params)
# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)
# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])
# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]
# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]
model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)
如果有任何疑问,请发表评论。