训练有素的word2vec模型词汇中缺少的词

时间:2019-05-08 04:40:58

标签: python tensorflow nltk gensim word2vec

我目前正在与python一起工作,在这里我使用提供的句子来训练Word2Vec模型。然后,我保存并加载模型,以将每个单词的词嵌入到用来训练模型的句子中。但是,出现以下错误。

  

KeyError:“单词'n1985_chicago_bears'不在词汇表中”

而训练期间提供的句子之一如下。

sportsteam n1985_chicago_bears teamplaysincity city chicago

因此,我想知道为什么尽管训练了该句子语料库中的那些单词,但词汇中还是缺少某些单词。

在自己的语料库上训练word2vec模型

import nltk
import numpy as np
from termcolor import colored
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA


#PREPARING DATA

fname = '../data/sentences.txt'

with open(fname) as f:
    content = f.readlines()

# remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]


#TOKENIZING SENTENCES

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

#TRAINING THE WORD2VEC MODEL

model = Word2Vec(sentences)

words = list(model.wv.vocab)
model.wv.save_word2vec_format('model.bin')

从句子.txt中抽取句子

sportsteam hawks teamplaysincity city atlanta
stadiumoreventvenue honda_center stadiumlocatedincity city anaheim
sportsteam ducks teamplaysincity city anaheim
sportsteam n1985_chicago_bears teamplaysincity city chicago
stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta
stadiumoreventvenue united_center stadiumlocatedincity city chicago
...

sentences.txt文件中有1860条这样的行,每行仅包含5个单词,没有停用词。

保存模型后,我尝试从与保存的model.bin相同目录中的另一个python文件加载它,如下所示。

加载已保存的模型。bin

import nltk
import numpy as np
from gensim import models

w = models.KeyedVectors.load_word2vec_format('model.bin', binary=True)
print(w['n1985_chicago_bears'])

但是,我最终遇到以下错误

KeyError: "word 'n1985_chicago_bears' not in vocabulary"

有没有一种方法可以使用相同的方法为训练后的句子语料库中的每个单词嵌入单词?

在这方面的任何建议将不胜感激。

1 个答案:

答案 0 :(得分:2)

Gen2的Word2Vec实现的默认min_count=5看起来像您要查找的n1985_chicago_bears的令牌在语料库中的出现次数少于5次。适当更改您的最小计数。

Method signature:

  

class gensim.models.word2vec.Word2Vec(sentences = None,   corpus_file = None,size = 100,alpha = 0.025,window = 5,min_count = 5,   max_vocab_size =无,样本= 0.001,种子= 1,工人= 3,   min_alpha = 0.0001,sg = 0,hs = 0,负值5,ns_exponent = 0.75,   cbow_mean = 1,hashfxn =,iter = 5,null_word = 0,   trim_rule =无,sorted_vocab = 1,batch_words = 10000,compute_loss = False,   callbacks =(),max_final_vocab = None)

content = [
    "sportsteam hawks teamplaysincity city atlanta",
    "stadiumoreventvenue honda_center stadiumlocatedincity city anaheim",
    "sportsteam ducks teamplaysincity city anaheim",
    "sportsteam n1985_chicago_bears teamplaysincity city chicago",
    "stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta",
    "stadiumoreventvenue united_center stadiumlocatedincity city chicago"
]

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

model = Word2Vec(sentences, min_count=1)
print (model['n1985_chicago_bears'])