Question

我正在使用word2vec进行文本分类。这是我的功能。

def get_w2v_features(w2v_model, sentence_group):
    """ Transform a sentence_group (containing multiple lists
    of words) into a feature vector. It averages out all the
    word vectors of the sentence_group.
    """
    words = np.concatenate(sentence_group)  # words in text
    index2word_set = set(w2v_model.wv.vocab.keys())  # words known to model

    featureVec = np.zeros(w2v_model.vector_size, dtype="float32")

    # Initialize a counter for number of words in a review
    nwords = 0
    # Loop over each word in the comment and, if it is in the model's vocabulary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            featureVec = np.add(featureVec, w2v_model[word])
            nwords += 1.

    # Divide the result by the number of words to get the average
    if nwords > 0:
        featureVec = np.divide(featureVec, nwords)
    return featureVec

data['w2v_features'] = list(map(lambda sen_group:
                                      get_w2v_features(W2Vmodel, sen_group),
                                      data.tokenized_sentences))

数据看起来像这样，功能创建得很好，如下所示。

dataframe left side dataframe right side

现在，当我在测试数据上运行该数据时，我具有与训练数据相同的带有标记化语句的数据，但是我遇到了错误。

test_data['w2v_features'] = list(map(lambda sen_group:
                                      get_w2v_features(W2Vmodel, sen_group),
                                      test_data.tokenized_sentences))

ValueError                                Traceback (most recent call last)
<ipython-input-73-b32852b152cb> in <module>()
      1 test_data['w2v_features'] = list(map(lambda sen_group:
      2                                       get_w2v_features(W2Vmodel, sen_group),
----> 3                                       test_data.tokenized_sentences))

<ipython-input-73-b32852b152cb> in <lambda>(sen_group)
      1 test_data['w2v_features'] = list(map(lambda sen_group:
----> 2                                       get_w2v_features(W2Vmodel, sen_group),
      3                                       test_data.tokenized_sentences))

<ipython-input-34-cd5ee3d13b85> in get_w2v_features(w2v_model, sentence_group)
      4     word vectors of the sentence_group.
      5     """
----> 6     words = np.concatenate(sentence_group)  # words in text
      7     index2word_set = set(w2v_model.wv.vocab.keys())  # words known to model
      8 

ValueError: need at least one array to concatenate

对测试数据运行功能时出错

0 个答案: