我正在使用word2vec进行文本分类。 这是我的功能。
def get_w2v_features(w2v_model, sentence_group):
""" Transform a sentence_group (containing multiple lists
of words) into a feature vector. It averages out all the
word vectors of the sentence_group.
"""
words = np.concatenate(sentence_group) # words in text
index2word_set = set(w2v_model.wv.vocab.keys()) # words known to model
featureVec = np.zeros(w2v_model.vector_size, dtype="float32")
# Initialize a counter for number of words in a review
nwords = 0
# Loop over each word in the comment and, if it is in the model's vocabulary, add its feature vector to the total
for word in words:
if word in index2word_set:
featureVec = np.add(featureVec, w2v_model[word])
nwords += 1.
# Divide the result by the number of words to get the average
if nwords > 0:
featureVec = np.divide(featureVec, nwords)
return featureVec
data['w2v_features'] = list(map(lambda sen_group:
get_w2v_features(W2Vmodel, sen_group),
data.tokenized_sentences))
数据看起来像这样,功能创建得很好,如下所示。
dataframe left side dataframe right side
现在,当我在测试数据上运行该数据时,我具有与训练数据相同的带有标记化语句的数据,但是我遇到了错误。
test_data['w2v_features'] = list(map(lambda sen_group:
get_w2v_features(W2Vmodel, sen_group),
test_data.tokenized_sentences))
ValueError Traceback (most recent call last)
<ipython-input-73-b32852b152cb> in <module>()
1 test_data['w2v_features'] = list(map(lambda sen_group:
2 get_w2v_features(W2Vmodel, sen_group),
----> 3 test_data.tokenized_sentences))
<ipython-input-73-b32852b152cb> in <lambda>(sen_group)
1 test_data['w2v_features'] = list(map(lambda sen_group:
----> 2 get_w2v_features(W2Vmodel, sen_group),
3 test_data.tokenized_sentences))
<ipython-input-34-cd5ee3d13b85> in get_w2v_features(w2v_model, sentence_group)
4 word vectors of the sentence_group.
5 """
----> 6 words = np.concatenate(sentence_group) # words in text
7 index2word_set = set(w2v_model.wv.vocab.keys()) # words known to model
8
ValueError: need at least one array to concatenate