Question

我正在尝试使用word2vec进行标记类的监督分类。我有包含带有标签的句子的数据，我希望在其上训练word2vec模型，然后使用word2vec（或可能是随机森林分类器）来找出看不见的句子的类别。到目前为止，这是我尝试过的：

import gensim
import numpy as np

class TokenizedSentence(object):

    def __init__(self, doc_list):
            self.doc_list = doc_list

    def __iter__(self):
            for t in (self.doc_list):
                    t = t.decode('utf-8')
                    yield gensim.utils.simple_preprocess(t)



tweets = ["a tweet", "another tweet", ... , "some tweet"]
labels = [1, 1, ... , 16]

training_data = TokenizedSentence(tweets)

model = gensim.models.Word2Vec(training_data, size=150, window=10, min_count=2, workers=10)
model.train(training_data, total_examples=model.corpus_count, epochs=200)
model.save("w2vmodel")

我想知道如何做才能包括每个句子附带的标签，以便我可以告诉模型句子所属的标签，然后将模型加载到其中一个单独的文件，并根据监督学习将看不见的数据分类。任何帮助表示赞赏！

使用word2vec进行监督分类

0 个答案: