如何在张量流服务中构建对内存友好的嵌入式矩阵?

时间:2018-09-11 13:09:51

标签: tensorflow keras gensim tensorflow-serving

我将keras模型转换为* .pb文件,但是我发现嵌入矩阵只包含数据中出现的单词(因为这样可以节省内存并加快嵌入矩阵的初始化)。

我需要加载所有单词以进行生产,是否存在任何一种方法,它不仅可以节省内存使用量,而且可以使嵌入矩阵包含所有单词?

这是预先加载的单词矢量代码:

def assign_pretrained_word_embedding(vocabulary_index2word,vocab_size,word2vec_model_path,text_file=False):
    from gensim.models import KeyedVectors as word2vec# we put import here so that many people who do not use word2vec do not need to install this package. you can move import to the beginning of this file.
    print("using pre-trained word emebedding.started.word2vec_model_path:",word2vec_model_path)
    if not text_file:
        word2vec_model = word2vec.load(word2vec_model_path,mmap='r')
    else:
        word2vec_model = word2vec.load_word2vec_format(word2vec_model_path, binary=False
                                                       # , unicode_errors='ignore'
                                                       )
    word2vec_dict = {}
    for word, vector in zip(word2vec_model.vocab, word2vec_model.vectors):
        word2vec_dict[word] = vector
    word_embedding_2dlist = np.zeros([vocab_size, 300])  # create an empty word_embedding list.
    # word_embedding_2dlist[0] =   # assign empty for first word:'PAD'
    bound = np.sqrt(6.0) / np.sqrt(vocab_size)  # bound for random variables.
    count_exist = 0;
    count_not_exist = 0
    for i in range(2, vocab_size):  # loop each word. notice that the first two words are pad and unknown token

        word = vocabulary_index2word[i]  # get a word
        embedding = None
        try:
            embedding = word2vec_dict[word]  # try to get vector:it is an array.
        except Exception:
            embedding = None
            # if len(word) >1 and not word.isalnum():
            #     for
            #     try:

        if embedding is not None:  # the 'word' exist a embedding
            word_embedding_2dlist[i] = embedding;
            count_exist = count_exist + 1  # assign array to this word.
        else:  # no embedding for this word
            word_embedding_2dlist[i] = np.random.uniform(-bound, bound, 300);
            count_not_exist = count_not_exist + 1  # init a random value for the word.
    word_embedding_final = np.array(word_embedding_2dlist,dtype='float32')  # covert to 2d array.
    print([x for x in word_embedding_final if len(x)!=300])
    print("word. exists embedding:", count_exist, " ;word not exist embedding:", count_not_exist)
    print("using pre-trained word emebedding.ended...")
    return word_embedding_final

0 个答案:

没有答案