我正在运行以下代码,但是gensim word2vec抛出的单词不属于词汇错误。你能告诉我解决方法吗?
这是我的文件(file.txt)
'intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one', 'better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', ..
这是我的代码
import gensim
with open('file.txt', 'r') as myfile:
data = myfile.read()
model = gensim.models.Word2Vec(data,min_count=1,size=32)
w1 = "good"
model.wv.most_similar (positive=w1)
输出:
KeyError: "word 'good' not in vocabulary"
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-34-22572d5a8082> in <module>()
7 model = gensim.models.Word2Vec(data,min_count=1,size=32)
8 w1 = "good"
----> 9 model.wv.most_similar (positive=w1)
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
529 mean.append(weight * word)
530 else:
--> 531 mean.append(weight * self.word_vec(word, use_norm=True))
532 if word in self.vocab:
533 all_words.add(self.vocab[word].index)
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py in word_vec(self, word, use_norm)
450 return result
451 else:
--> 452 raise KeyError("word '%s' not in vocabulary" % word)
453
454 def get_vector(self, word):
KeyError: "word 'good' not in vocabulary"
答案 0 :(得分:1)
import gensim
data=[]
with open('lastlast.txt', 'r') as myfile:
raw_data = myfile.read()
raw_data=raw_data.replace('\n',',')
split_data=raw_data.split(',')
data=[i.replace("\'",'').replace(' ','') for i in split_data if i!=""]
第一个参数应该是可迭代的。由于数据只是句子的可迭代项,因此它占用每个字符,但[数据]占用每个单词。 来自文档
>>> model = gensim.models.Word2Vec([data],min_count=1,size=32)
>>> model = Word2Vec.load("word2vec.model")
>>> model.train([["hello", "world"]], total_examples=1, epochs=1)
您的解决方案:- 现在,如果您这样做,您将得到答案。
>>>model.most_similar(['good'])