我正在构建一个程序,该程序将多个标签/标签分配给文本描述。我正在使用Scikit-Learn的OneVsRestClassifier + XGBClassifier对矢量化的文本描述进行分类。我正在使用Gensim的Word2Vec对文本进行矢量化处理。但是,当我尝试使分类器适合矢量化数据时,出现以下错误:
IndexError:元组索引超出范围
下面是我的代码(错误发生在我尝试适合分类器的最后一行):
w2vModel = Word2Vec(sentences, size=150, window=10, min_count=2, workers=multiprocessing.cpu_count())
modelCorpus = list(w2vModel.wv.vocab)
descriptions = []
for sentence in sentences:
wordList = []
for word in sentence:
if (word in modelCorpus):
wordList.append(w2vModel.wv[word])
descriptions.append(np.concatenate(wordList))
x = np.array(descriptions)
# Vectorize ticket labels/tags using MultiLabelBinarizer
tagList = relevantDF.Tags # Retrieve list of tags
vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)
# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)
yTrain = csr_matrix(yTrain).toarray()
# Fit OneVsRestClassifier w/ XGBClassifier
clf = OneVsRestClassifier(XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.003))
clf.fit(xTrain, yTrain)
x的形状是:(8347,)
y的形状是:(8347,24)
xTrain的形状是:(6677,)
yTrain的形状为:(6677,24)
答案 0 :(得分:0)
我的猜测是每个样本没有固定数量的功能。较长的句子比较短的句子为描述增加了更多的单词向量。代替串联词向量,您可以对它们求平均:
descriptions.append(np.mean( np.array(wordList), axis=0 ))
或者您看看Doc2Vec,它会为每个句子生成一个向量。