使用R中的神经网络对具有自己词嵌入的文本进行分类

时间:2019-03-22 10:37:39

标签: r neural-network nlp word-embedding

这是一个相当长的过程,所以请耐心等待,不幸的是,错误恰恰发生在最后...我无法在看不见的测试集上进行预测!

我想用嵌入到神经网络中的单词嵌入(在数据集上受过训练)对文本进行分类。 我只用文字说明=输入和四个不同的价格类别=目标的列。

对于一个可复制的示例,以下是必要的数据集和单词embedding:

DF:https://www.dropbox.com/s/it0jsbv8e7nkryt/DF.csv?dl=0

WordEmb:https://www.dropbox.com/s/ia5fmio2e0plwkr/WordEmb.txt?dl=0

这是我的代码:

set.seed(2077)
DF = read.delim("DF.csv", header = TRUE, sep = ",",
                             dec = ".", stringsAsFactors = FALSE)
DF <- DF[,-1]

# parameters
max_num_words = 9000         # simply see number of observations
validation_split = 0.3
embedding_dim = 300

##### Data Preparation #####

# split into training and test set

set.seed(2077)
n <- nrow(DF)
shuffled <- DF[sample(n),]

# Split the data in train and test
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]
rm(n, shuffled)

# predictor/target variable
x_train <- train$Description
x_test <- test$Description

y_train <- train$Price_class
y_test <- test$Price_class

### encode target variable ###

# One hot encode training target values
trainLabels <- to_categorical(y_train) 
trainLabels <- trainLabels[, 2:5]

# One hot encode test target values
testLabels <- keras::to_categorical(y_test)
testLabels <- testLabels[, 2:5]

### encode predictor variable ###

# pad sequences
tokenizer <- text_tokenizer(num_words = max_num_words)

# finally, vectorize the text samples into a 2D integer tensor

set.seed(2077)
tokenizer %>% fit_text_tokenizer(x_train)
train_data <- texts_to_sequences(tokenizer, x_train)
tokenizer %>% fit_text_tokenizer(x_test)
test_data <- texts_to_sequences(tokenizer, x_test)

# determine average length of document -> set as maximal sequence length
seq_mean <- stri_count(train_data, regex="\\S+")
mean((seq_mean))

max_sequence_length = 70

# This turns our lists of integers into a 2D integer tensor of shape`(samples, maxlen)`

x_train <- keras::pad_sequences(train_data, maxlen = max_sequence_length)
x_test <- keras::pad_sequences(test_data, maxlen = max_sequence_length)

word_index <- tokenizer$word_index
Encoding(names(word_index)) <- "UTF-8"

#### PREPARE EMBEDDING MATRIX ####

embeddings_index <- new.env(parent = emptyenv())
lines <- readLines("WordEmb.txt")
for (line in lines) {
  values <- strsplit(line, ' ', fixed = TRUE)[[1]]    
  word <- values[[1]]
  coefs <- as.numeric(values[-1])
  embeddings_index[[word]] <- coefs
}

embedding_dim <- 300
embedding_matrix <- array(0,c(max_num_words, embedding_dim))
for(word in names(word_index)){
  index <- word_index[[word]]
  if(index < max_num_words){
    embedding_vector <- embeddings_index[[word]]
    if(!is.null(embedding_vector)){
      embedding_matrix[index+1,] <- embedding_vector  
    }
  }
}

##### Convolutional Neural Network #####

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
num_words <- min(max_num_words, length(word_index) + 1)

embedding_layer <- keras::layer_embedding(
  input_dim = num_words,
  output_dim = embedding_dim,
  weights = list(embedding_matrix), 
  input_length = max_sequence_length,
  trainable = FALSE
)

# train a 1D convnet with global maxpooling
sequence_input <- layer_input(shape = list(max_sequence_length), dtype='int32')

preds <- sequence_input %>%
  embedding_layer %>% 
  layer_conv_1d(filters = 128, kernel_size = 1, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 128, kernel_size = 1, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 128, kernel_size = 1, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 2) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dense(units = 4, activation = 'softmax')

model <- keras_model(sequence_input, preds)

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = 'adam',
  metrics = c('acc')  
)

model %>% keras::fit(
  x_train,                              
  trainLabels,                          
  batch_size = 1024,
  epochs = 20,
  validation_split = 0.3
)

现在这是我遇到的问题: 我无法使用NN的结果来预测看不见的测试数据集:

# Predict the classes for the test data
classes <- model %>% predict_classes(x_test, batch_size = 128)

I get this error: 
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: 'Model' object has no attribute 'predict_classes'

Afterwards, I'd proceed like this:

# Confusion matrix
table(y_test, classes)

# Evaluate on test data and labels
score <- model %>% evaluate(x_val, testLabels, batch_size = 128) 

# Print the score
print(score)

目前,实际精度并不重要,因为这只是我的数据集的一个小例子。

我知道这是一个漫长的过程,但是非常感谢AAANNY的帮助。

0 个答案:

没有答案