在wordvec方法中包括分类特征和文本 - Python

时间:2018-03-05 06:52:27

标签: python word2vec categorical-data

我的数据混合了文字和分类功能。它看起来像:

cr_id   description                  business type                       status
    1   More robust system required secured loan    system  rejected
    2   More robust system required secured loan    system  rejected
    3   grant access to all products    mortgage    system  rejected
    4   EDAP Scenario   secured loan    regulatory           accepted
    5   grant access to all products    secured loan    regulatory  accepted

现在在描述栏中,我正在应用word2vec方法。结果我获得了300列的训练矢量。现在我的问题是如何在模型中包含分类特征。我的意思是如何将word2vec和one-hot编码向量的输出结合起来,因为它们具有不同的形状。用于形成wordvec的代码是:

def review_to_wordlist(review, remove_stopwords=False):
    """
    Convert a review to a list of words. Removal of stop words is optional.
    """
    # remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review)

    # convert to lower case and split at whitespace
    words = review_text.lower().split()

    # remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]

    return words

def review_to_sentences(review, tokenizer, remove_stopwords=False):
    """
    Split review into list of sentences where each sentence is a list of words.
    Removal of stop words is optional.
    """
    # use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    # each sentence is furthermore split into words
    sentences = []    
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))

    return sentences

train_sentences = []  # Initialize an empty list of sentences
for review in data['description']:
    train_sentences += review_to_sentences(review, tokenizer)

model_name = 'train_model'
# Set values for various word2vec parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 3   # Minimum word count                        
num_workers = 3       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words
print 1
if not os.path.exists(model_name): 
    # Initialize and train the model (this will take some time)
    model = word2vec.Word2Vec(train_sentences, workers=num_workers, \
                size=num_features, min_count = min_word_count, \
                window = context, sample = downsampling)
    print 10
    # If you don't plan to train the model any further, calling 
    # init_sims will make the model much more memory-efficient.
    model.init_sims(replace=True)

    # It can be helpful to create a meaningful model name and 
    # save the model for later use. You can load it later using Word2Vec.load()
    model.save(model_name)
else:
    print 11
#     model = Word2Vec.load(model_name)
    model=KeyedVectors.load(model_name)

def make_feature_vec(words, model, num_features):
    """
    Average the word vectors for a set of words
    """
    feature_vec = np.zeros((num_features,),dtype="float32")  # pre-initialize (for speed)
    nwords = 0
    index2word_set = set(model.wv.index2word)  # words known to the model

    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            feature_vec = np.add(feature_vec,model[word])

    feature_vec = np.divide(feature_vec, nwords)
    return feature_vec


def get_avg_feature_vecs(reviews, model, num_features):
    """
    Calculate average feature vectors for all reviews
    """
    counter = 0
    review_feature_vecs = np.zeros((len(reviews),num_features), dtype='float32')  # pre-initialize (for speed)

    for review in reviews:
        review_feature_vecs[counter] = make_feature_vec(review, model, num_features)
        counter = counter + 1
    return review_feature_vecs

clean_train_reviews = []
for review in data['description']:
    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True))
trainDataVecs = get_avg_feature_vecs(clean_train_reviews, model, num_features)

现在我可以将这些训练过的数据用于任何分类器以预测状态。现在如上所述,我很困惑如何包含/附加分类的fetaures,例如使用trainedDataVecs的类型,因为它们具有不同的形状。

编辑:根据需要更好地理解问题,trainingDatavecs如下所示:

array([[-0.02591809, -0.04678563, -0.0401891 , ..., -0.00907444,
        -0.02070936, -0.02332937],
       [ 0.00098296, -0.00293253,  0.04667222, ...,  0.00685261,
        -0.01234391, -0.03822058],
       [ 0.01843361, -0.01345504, -0.01359649, ..., -0.04710409,
         0.04892955,  0.02135875],
       ...,
       [ 0.0304883 ,  0.08515919,  0.01928426, ..., -0.00903708,
        -0.00333895,  0.07550056],
       [ 0.01843361, -0.01345504, -0.01359649, ..., -0.04710409,
         0.04892955,  0.02135875],
       [ 0.01843361, -0.01345504, -0.01359649, ..., -0.04710409,
         0.04892955,  0.02135875]], dtype=float32)

其尺寸为(40,300),输入训练数据的尺寸为(40,5)。 300列来自功能参数的数量。

1 个答案:

答案 0 :(得分:0)

看起来bootbox.confirm({ title: "Confirmation", message: "Are you sure you want to delete the selected items?", buttons: { confirm: { label: 'Yes', className: 'btn-success', }, cancel: { label: 'No', className: 'btn-danger' } }, callback: function (result) { if (result == true) { $("#delete").submit(); } } }); 只是一个trainDataVecs矩阵。您可以生成虚拟对象并尝试连接它们。

numpy
  1. dummies = pd.get_dummies(df['business type']).iloc[:, 1:] # take n-1 columns dummies = dummies.as_matrix() X = np.concatenate((trainDataVecs, dummies), axis=1)

  2. 创建Dummify矩阵

  3. 按列numpy

  4. 连接它们

    注意:

    可能存在尺寸问题。请发布axis在这种情况下的样子,我可以更新以反映。