我正在尝试使用gensim word2vec。我无法训练基于布朗语料库的模型。这是我的代码。
from gensim import models
model = models.Word2Vec([sentence for sentence in models.word2vec.BrownCorpus("E:\\nltk_data\\")],workers=4)
model.save("E:\\data.bin")
我使用nltk.download()
下载了nltk_data。我收到以下错误。
C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py:401: UserWarning: Cython compilation failed, training will be slow. Do you have Cython installed? `pip install cython`
warnings.warn("Cython compilation failed, training will be slow. Do you have Cython installed? `pip install cython`")
Traceback (most recent call last):
File "E:\eclipse_workspace\Python_files\Test\Test.py", line 8, in <module>
model = models.Word2Vec([sentence for sentence in models.word2vec.BrownCorpus("E:\\nltk_data\\")],workers=4)
File "C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py", line 276, in __init__
self.train(sentences)
File "C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py", line 407, in train
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
我做错了什么?
答案 0 :(得分:10)
也许你以错误的方式创造句子 试试这个,它对我有用。
import gensim
import logging
from nltk.corpus import brown
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)
model.save('/tmp/brown_model')
日志部分不是必需的,您可以根据需要更改Word2Vec()
中的参数。
答案 1 :(得分:2)
您需要完整的目录路径,而不仅仅是nltk_data
目录。在我的系统上它将是:
from os.path import expanduser, join
from gensim.models.word2vec import BrownCorpus, Word2Vec
dirname = expanduser(join('~', 'nltk_data', 'corpora', 'brown'))
model = Word2Vec(BrownCorpus(dirname))
model.similar_by_word('house/nn')
给出:
[(u'room/nn', 0.9538693428039551), (u'door/nn', 0.9475813508033752), ...
请注意,NLTK中的Brown Corpus附带POS标签。 Gensim BrownCorpus
类忽略非字母标记,但保留POS标记。使用nltk.corpus.brown.sents()
,您可以获得没有POS标签的句子。