我的数据混合了文字和分类功能。它看起来像:
cr_id description business type status
1 More robust system required secured loan system rejected
2 More robust system required secured loan system rejected
3 grant access to all products mortgage system rejected
4 EDAP Scenario secured loan regulatory accepted
5 grant access to all products secured loan regulatory accepted
现在在描述栏中,我正在应用word2vec方法。结果我获得了300列的训练矢量。现在我的问题是如何在模型中包含分类特征。我的意思是如何将word2vec和one-hot编码向量的输出结合起来,因为它们具有不同的形状。用于形成wordvec的代码是:
def review_to_wordlist(review, remove_stopwords=False):
"""
Convert a review to a list of words. Removal of stop words is optional.
"""
# remove non-letters
review_text = re.sub("[^a-zA-Z]"," ", review)
# convert to lower case and split at whitespace
words = review_text.lower().split()
# remove stop words (false by default)
if remove_stopwords:
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
return words
def review_to_sentences(review, tokenizer, remove_stopwords=False):
"""
Split review into list of sentences where each sentence is a list of words.
Removal of stop words is optional.
"""
# use the NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
# each sentence is furthermore split into words
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
return sentences
train_sentences = [] # Initialize an empty list of sentences
for review in data['description']:
train_sentences += review_to_sentences(review, tokenizer)
model_name = 'train_model'
# Set values for various word2vec parameters
num_features = 300 # Word vector dimensionality
min_word_count = 3 # Minimum word count
num_workers = 3 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
print 1
if not os.path.exists(model_name):
# Initialize and train the model (this will take some time)
model = word2vec.Word2Vec(train_sentences, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, sample = downsampling)
print 10
# If you don't plan to train the model any further, calling
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)
# It can be helpful to create a meaningful model name and
# save the model for later use. You can load it later using Word2Vec.load()
model.save(model_name)
else:
print 11
# model = Word2Vec.load(model_name)
model=KeyedVectors.load(model_name)
def make_feature_vec(words, model, num_features):
"""
Average the word vectors for a set of words
"""
feature_vec = np.zeros((num_features,),dtype="float32") # pre-initialize (for speed)
nwords = 0
index2word_set = set(model.wv.index2word) # words known to the model
for word in words:
if word in index2word_set:
nwords = nwords + 1.
feature_vec = np.add(feature_vec,model[word])
feature_vec = np.divide(feature_vec, nwords)
return feature_vec
def get_avg_feature_vecs(reviews, model, num_features):
"""
Calculate average feature vectors for all reviews
"""
counter = 0
review_feature_vecs = np.zeros((len(reviews),num_features), dtype='float32') # pre-initialize (for speed)
for review in reviews:
review_feature_vecs[counter] = make_feature_vec(review, model, num_features)
counter = counter + 1
return review_feature_vecs
clean_train_reviews = []
for review in data['description']:
clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True))
trainDataVecs = get_avg_feature_vecs(clean_train_reviews, model, num_features)
现在我可以将这些训练过的数据用于任何分类器以预测状态。现在如上所述,我很困惑如何包含/附加分类的fetaures,例如使用trainedDataVecs的类型,因为它们具有不同的形状。
编辑:根据需要更好地理解问题,trainingDatavecs如下所示:
array([[-0.02591809, -0.04678563, -0.0401891 , ..., -0.00907444,
-0.02070936, -0.02332937],
[ 0.00098296, -0.00293253, 0.04667222, ..., 0.00685261,
-0.01234391, -0.03822058],
[ 0.01843361, -0.01345504, -0.01359649, ..., -0.04710409,
0.04892955, 0.02135875],
...,
[ 0.0304883 , 0.08515919, 0.01928426, ..., -0.00903708,
-0.00333895, 0.07550056],
[ 0.01843361, -0.01345504, -0.01359649, ..., -0.04710409,
0.04892955, 0.02135875],
[ 0.01843361, -0.01345504, -0.01359649, ..., -0.04710409,
0.04892955, 0.02135875]], dtype=float32)
其尺寸为(40,300),输入训练数据的尺寸为(40,5)。 300列来自功能参数的数量。
答案 0 :(得分:0)
看起来bootbox.confirm({
title: "Confirmation",
message: "Are you sure you want to delete the selected items?",
buttons: {
confirm: {
label: 'Yes',
className: 'btn-success',
},
cancel: {
label: 'No',
className: 'btn-danger'
}
},
callback: function (result) {
if (result == true) {
$("#delete").submit();
}
}
});
只是一个trainDataVecs
矩阵。您可以生成虚拟对象并尝试连接它们。
numpy
dummies = pd.get_dummies(df['business type']).iloc[:, 1:] # take n-1 columns
dummies = dummies.as_matrix()
X = np.concatenate((trainDataVecs, dummies), axis=1)
列
创建Dummify
矩阵
按列numpy
注意:强>
可能存在尺寸问题。请发布axis
在这种情况下的样子,我可以更新以反映。