我正在训练文件嵌入大约2000万个句子并在gensim中使用并行处理。我正在使用以下代码创建我的模型和培训
class read_corpus(object):
def __init__(self, fname, n):
self.fname = fname
self.n = n
def __iter__(self):
num_notes = 0
with open(self.fname, 'r') as f:
while num_notes < n:
note = next(f)
sentence_id, sentence = note.split('\t')
# remove the newline character after each line and split into words
sentence = sentence[:-1].split(' ')
# some processing
yield TaggedDocument(sentence, [sentence_id])
num_notes += 1
def model(fname, vector_size, min_count,
n_epochs, model_name,
n, prev_model_name=None):
data = read_corpus(fname, n)
if prev_model_name is not None:
model = Doc2Vec.load(prev_model_name)
else:
model = Doc2Vec(vector_size=vector_size,
min_count=min_count,
workers=4,
window=8,
alpha=0.1,
min_alpha=0.0001)
model.build_vocab(data)
model.train(data, total_examples=model.corpus_count, epochs=n_epochs)
model.save(model_name)
在6到8个时期之后,日志记录信息显示训练卡在等待工作线程。 注意:日志信息显示“EPOCH 1”,因为我正在训练for循环。
...
INFO : EPOCH 1 - PROGRESS: at 99.71% examples, 162493 words/s, in_qsize 8, out_qsize 0
INFO : EPOCH 1 - PROGRESS: at 99.81% examples, 162528 words/s, in_qsize 7, out_qsize 0
INFO : EPOCH 1 - PROGRESS: at 99.91% examples, 162560 words/s, in_qsize 7, out_qsize 0
INFO : worker thread finished; awaiting finish of 3 more threads
INFO : worker thread finished; awaiting finish of 2 more threads
它被困在这里几个小时。
我之前的运行中有类似的输出。但是日志记录在INFO : worker thread finished; awaiting finish of 3 more threads