Question

我正在尝试使用Word2Vec模型。根据Gensim的Word2Vec文档，我们在使用它之前不需要调用model.build_vocabulary。但它要求我这样做。我试过调用这个函数，但它没有用。我之前还安装了Word2Vec模型，无需调用model.build_vocabulary。

我做错了吗？这是我的代码：

from gensim.models import Word2Vec
dataset = pd.read_table('genemap_copy.txt',delimiter='\t', lineterminator='\n')

def row_to_sentences(dataframe):
    columns = dataframe.columns.values
    corpus = []
    for index,row in dataframe.iterrows():
        if index == 1000:
            break
        sentence = ''
        for column in columns:
            sentence += ' '+str(row[column])
        corpus.append([sentence])
    return corpus

corpus = row_to_sentences(dataset)
clean_corpus = [[sentence[0].lower()] for sentence in corpus ]


# model = Word2Vec()
# model.build_vocab(clean_corpus)
model = Word2Vec(clean_corpus, size=100, window=5, min_count=5, workers=4)

非常感谢帮助！我也在使用macOS Sierra。在Mac上使用Gensim并没有太多的支持：。

Answer 1

尝试LineSentence：

from gensim.models.word2vec import LineSentence

然后用

训练你的语料库

model = Word2Vec(LineSentence(clean_corpus), size=100, window=5, min_count=5, workers=4)

Answer 2

我认为我的问题是参数min_count=5所以如果他们的出现次数不超过5次，那么我的大多数话都没有考虑。

Answer 3

您是否每次都附加一个包含单个句子的新列表？ corpus.append([sentence])。你需要为Word2Vec提供一系列句子，但不一定是文档收集的句子。我也不清楚你的df中有什么，但你已经将这些句子标记出来了吗？

我以前用过Word2Vec的生成器类...

from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess

class MySentences(object):
    def __init__(self, docs):
        self.corpus = docs
    def __iter__(self):
        for doc in self.corpus:
            doc_sentences = sent_tokenize(doc)
            for sent in doc_sentences:
                yield simple_preprocess(sent) # yields a tokenized 

sentence ['like','this','one','.']

sentences = MySentences(df['text'].tolist())
model = gensim.models.Word2Vec(sentences, min_count=5, workers=8, size=300, sg=1)

Gensim Word2Vec＆＃39;你必须先培养词汇量才能训练模型＆＃39;

3 个答案: