Question

我正在尝试使用Python中gensim自然语言处理库中的word2vec模块。

文档说要初始化模型：

from gensim.models import word2vec
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

gensim对输入句子的期望是什么格式？我有原始文本

"the quick brown fox jumps over the lazy dogs"
"Then a cop quizzed Mick Jagger's ex-wives briefly."
etc.

我需要将哪些其他处理内容发布到word2fec？

更新：以下是我的尝试。当它加载句子时，我什么都没得到。

>>> sentences = ['the quick brown fox jumps over the lazy dogs',
             "Then a cop quizzed Mick Jagger's ex-wives briefly."]
>>> x = word2vec.Word2Vec()
>>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences])
>>> x.vocab
{}

Answer 1

A list of utf-8 sentences。您还可以从磁盘流式传输数据。

确保它是utf-8，并将其拆分：

sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)

Answer 2

如alKid所述，请将其设为utf-8。

谈论另外两件可能需要担心的事情。

输入太大而您正在从文件中加载它。
从句子中删除停用词。

您可以执行以下操作：

import nltk, gensim
class FileToSent(object):    
    def __init__(self, filename):
        self.filename = filename
        self.stop = set(nltk.corpus.stopwords.words('english'))

    def __iter__(self):
        for line in open(self.filename, 'r'):
        ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
        yield ll

然后，

sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)

如何将句子加载到Python gensim中？

2 个答案: