Sklearn分类器无法使用Gensim Word2Vec数据进行训练

时间:2019-08-22 21:05:49

标签: python machine-learning scikit-learn gensim word2vec

我正在构建一个程序,该程序将多个标签/标签分配给文本描述。我正在使用Scikit-Learn的OneVsRestClassifier + XGBClassifier对矢量化的文本描述进行分类。我正在使用Gensim的Word2Vec对文本进行矢量化处理。但是,当我尝试使分类器适合矢量化数据时,出现以下错误:

  

IndexError:元组索引超出范围

下面是我的代码(错误发生在我尝试适合分类器的最后一行):

w2vModel = Word2Vec(sentences, size=150, window=10, min_count=2, workers=multiprocessing.cpu_count())
modelCorpus = list(w2vModel.wv.vocab)

descriptions = []
for sentence in sentences:
    wordList = []
    for word in sentence: 
        if (word in modelCorpus):
            wordList.append(w2vModel.wv[word])
    descriptions.append(np.concatenate(wordList))

x = np.array(descriptions)

# Vectorize ticket labels/tags using MultiLabelBinarizer
tagList = relevantDF.Tags # Retrieve list of tags
vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)

# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)
yTrain = csr_matrix(yTrain).toarray()

# Fit OneVsRestClassifier w/ XGBClassifier
clf = OneVsRestClassifier(XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.003))
clf.fit(xTrain, yTrain)

x的形状是:(8347,)

y的形状是:(8347,24)

xTrain的形状是:(6677,)

yTrain的形状为:(6677,24)

1 个答案:

答案 0 :(得分:0)

我的猜测是每个样本没有固定数量的功能。较长的句子比较短的句子为描述增加了更多的单词向量。代替串联词向量,您可以对它们求平均:

descriptions.append(np.mean( np.array(wordList), axis=0 ))

或者您看看Doc2Vec,它会为每个句子生成一个向量。