如何使用Glove字嵌入构建模型并使用R

时间:2018-03-05 22:23:19

标签: r word2vec text-classification word-embedding text2vec

我正在使用GloVe字嵌入将文本数据分类模型分为两类(即将每个评论分为两类)。我有两列,一列是文本数据(注释),另一列是二元Target变量(注释是否可操作)。我可以使用text2vec文档中的以下代码为文本数据生成Glove字嵌入。

glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary = 
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter = 20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)

如何构建模型并生成测试数据预测?

2 个答案:

答案 0 :(得分:1)

#include <iostream> #include <string> int main() { std::string sentence = "hello,potato tomato."; std::string delims = " .,"; size_t beg, pos = 0; while ((beg = sentence.find_first_not_of(delims, pos)) != std::string::npos) { pos = sentence.find_first_of(delims, beg + 1); std::cout << sentence.substr(beg, pos - beg) << std::endl; } } 有一个标准的text2vec方法(与大多数predict库一样),您可以直接使用它们:查看documentation

简而言之,只需使用

即可
R

答案 1 :(得分:0)

知道了。

glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary = 
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter =20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)

创建单词嵌入后,构建一个将单词(字符串)映射到其矢量表示(数字)的索引

embeddings_index <- new.env(parent = emptyenv())
for (line in lines) {
values <- strsplit(line, ' ', fixed = TRUE)[[1]]    
word <- values[[1]]
coefs <- as.numeric(values[-1])
embeddings_index[[word]] <- coefs
}

接下来,构建一个可以加载到嵌入层的形状嵌入矩阵(max_words,embedding_dim)。

embedding_dim <- 50 (number of dimensions you wish to represent each word).
embedding_matrix <- array(0,c(max_words,embedding_dim))
for(word in names(word_index)){
  index <- word_index[[word]]
  if(index < max_words){
    embedding_vector <- embeddings_index[[word]]
    if(!is.null(embedding_vector)){
      embedding_matrix[index+1,] <- embedding_vector  #words not found in 
the embedding index will all be zeros
    }
  }
}
We can then load this embedding matrix into the embedding layer, build a 
model and then generate predictions.

model_pretrained <- keras_model_sequential() %>% layer_embedding(input_dim = max_words,output_dim = embedding_dim) %>%
                layer_flatten()%>%layer_dense(units=32,activation = "relu")%>%layer_dense(units = 1,activation = "sigmoid")
summary(model_pretrained)

#Loading the glove embeddings in the model
get_layer(model_pretrained,index = 1) %>% 
set_weights(list(embedding_matrix)) %>% freeze_weights()

model_pretrained %>% compile(optimizer = "rmsprop",loss="binary_crossentropy",metrics=c("accuracy"))

history <-model_pretrained%>%fit(x_train,y_train,validation_data = list(x_val,y_val),
                                epochs = num_epochs,batch_size = 32) 

然后使用标准预测函数生成预测。

检查以下链接。 Use word embeddings to build a model in Keras

Pre-trained word embeddings