在spacy中,如何使用gensim中创建的自己的word2vec模型?

时间:2018-05-22 11:32:20

标签: model word2vec gensim spacy

我已经在gensim中训练了我自己的word2vec模型,我正在尝试在spacy中加载该模型。首先,我需要将它保存在我的磁盘中,然后尝试在spacy中加载init模型,但无法确切地知道如何。

gensimmodel
Out[252]:
<gensim.models.word2vec.Word2Vec at 0x110b24b70>

import spacy
spacy.load(gensimmodel)

OSError: [E050] Can't find model 'Word2Vec(vocab=250, size=1000, alpha=0.025)'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

2 个答案:

答案 0 :(得分:9)

训练并以纯文本格式保存模型:

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

path = get_tmpfile("./data/word2vec.model")

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("./data/word2vec.txt")

使用Gzip压缩文本文件:

gzip word2vec.txt

哪个会生成word2vec.txt.gz文件。

运行以下命令:

python -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc word2vec.txt.gz

使用以下方法加载向量

nlp = spacy.load('en', vectors='./data/spacy.word2vec.model/')

答案 1 :(得分:4)

正如here所述,您可以使用Gensim,Fast Text或Tomas Mikolov的原始word2vec实现导入自定义单词向量,方法是使用以下方法创建模型:

public class Box implements Serializable {
    private Color color;
    private Integer value;
    private Dice dice;

    public Box(Color color) {
        this.color = color;
    }

    public Box(Integer value) {
        this.value = value;
    }

    public Color getColor() {
        return color;
    }

    public Integer getValue() {
        return value;
    }

    public boolean isValueSet() {
        return value != null;
    }

    public void insertDice(Dice dice) {
        this.dice = dice;
        //TODO the dice at this point must removed from the dice drafted --> dices (set).remove();
    }

    public void removeDice() {
        if (dice != null) dice = null;
        //TODO dice must be re-added?
    }

    public Dice getDice() {
        return dice;
    }

    @Override
    public String toString() {
        if (isValueSet()) return String.valueOf(value);
        else return color.toString();
    }

    Boolean isDiceSet(){ return dice != null; }
}

然后你可以加载模型,wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz python -m spacy init-model en your_model --vectors-loc cc.la.300.vec.gz 并使用它!

另请参阅回答here的类似问题。