What I have achieved so far are models that can not be read by a person. I need to save the model as plain text to use it with a certain software, which requires that the model be this way.
I tried the following:
model = models.doc2vec.Doc2Vec(size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
model.save('mymodel.txt')
But I get:
Process finished with exit code -1073741571 (0xC00000FD)
I do not know if I should pass a specific parameter.
答案 0 :(得分:1)
The native gensim save()
has no plain-text option: it makes use of Python core functionality like object-pickling, or writing large raw floating-point arrays (to secondary files with extra extensions .npy
). Such files will include raw binary data – and merely specifying a .txt
filename doesn't have any affect on what is written.
You can save just the word-vectors into the one-vector-per-line, plain-text format used by the original Google word2vec.c
by using the alternate method save_word2vec_format()
. Also, recent versions of gensim Doc2Vec
add an optional doctag_vec
option to this method. If you supply doctag_vec=True
, the doctag vectors will also be saved to the file – with their tag-names distinguished from word-vectors by an extra prefix. See the method's doc-comment and source-code for more info:
However, no variant of save_word2vec_format()
saves the entire model, with internal model-weights and the vocabulary/doctag information (like relative frequencies) that are necessary for continued training. For that, you must use the native save()
. If you need the full Doc2Vec
model in a text format, you'll need to write that save code yourself, perhaps using the above method as a partial guide. (Additionally, I'm not aware of a preexisting convention for representing a whole model – so you'd have to find or devise that yourself, to match your needs wherever the full model is later-to-be-loaded.)
Separately regarding your Doc2Vec
initialization parameters:
a min_count=0
is usually a bad idea: rare words make models worse, so the default of min_count=5
usually improves models, and as your corpus gets larger, even larger min_count
values discarding more low-frequency words tend to help model quality (as well as speeding training and shrinking the model's RAM/save sizes)
a min_alpha
the same as alpha
is usually a bad idea, and means that train()
is no longer performing the linear-decay of the alpha
learning-rate that's the usual and effective manner of performing stochastic-gradient-descent optimization of the model