R Lime包文本数据

时间:2018-06-08 01:26:54

标签: r nlp tokenize

我正在探索在文本数据集上使用R lime来解释黑盒模型的预测,并且遇到了一个例子https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html

正在对餐馆评论数据集进行测试,但发现一些情节产生的plot_features并未打印所有功能。我想知道是否有人可以为此提供任何建议/见解,为什么会发生这种情况或推荐使用不同的包。非常感谢大家的帮助,因为在网上找不到很多关于R lime的工作。谢谢!

数据集:https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

library(caret)
model <- train(Liked~., data=training_set, method="xgbTree")

######
#LIME#
######
library(lime)
explainer <- lime(training_set, model)
explanation <- explain(test_set[1:4,], explainer, n_labels = 1, n_features = 5)
plot_features(explanation)

我不受欢迎的输出:https://www.dropbox.com/s/pf9dq0kba0d5flt/Udemy_NLP_Lime.jpeg?dl=0

我想要的(不同的数据集):https://www.dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl=0

1 个答案:

答案 0 :(得分:1)

我无法打开您为数据集和输出提供的链接。但是,我使用的是您提供的https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html链接。我使用text2vec,如链接中所示,并使用xgboost包进行分类;它对我有用。要显示更多功能,您可能需要在explain函数中增加n_features的值,请参见https://www.rdocumentation.org/packages/lime/versions/0.4.0/topics/explain

library(lime)
library(xgboost)  # the classifier
library(text2vec) # used to build the BoW matrix

# load data
data(train_sentences, package = "lime")  # from lime 
data(test_sentences, package = "lime")   # from lime

# Tokenize data
get_matrix <- function(text) {
  it <- text2vec::itoken(text, progressbar = FALSE)

  # use the following lines if you want to prune vocabulary
  # vocab <- create_vocabulary(it, c(1L, 1L)) %>%   
  # prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
  #   vectorizer <- vocab_vectorizer(vocab )

  # there is no option to prune the vocabulary, but it is very fast for big data
  vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 1L))
  text2vec::create_dtm(it,vectorizer = vectorizer) # hash_vectorizer())
}

# BoW matrix generation
# features should be the same for both dtm_train and dtm_test 
dtm_train <- get_matrix(train_sentences$text)
dtm_test  <- get_matrix(test_sentences$text) 

# xgboost for classification
param <- list(max_depth = 7, 
          eta = 0.1, 
          objective = "binary:logistic", 
          eval_metric = "error", 
          nthread = 1)

xgb_model <-xgboost::xgb.train(
  param, 
  xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
  nrounds = 100 
)

# prediction
predictions <- predict(xgb_model, dtm_test) > 0.5
test_labels <- test_sentences$class.text == "OWNX"

# Accuracy
print(mean(predictions == test_labels))

# what are the most important words for the predictions.
n_features <- 5 # number of features to display
sentence_to_explain <- head(test_sentences[test_labels,]$text, 6)
explainer <- lime::lime(sentence_to_explain, model = xgb_model, 
                    preprocess = get_matrix)
explanation <- lime::explain(sentence_to_explain, explainer, n_labels = 1, 
                         n_features = n_features)

#
explanation[, 2:9]

# plot
lime::plot_features(explanation)

在您的代码中,在应用于train_sentences数据集时,将在下一行中创建NA。请检查以下代码。

dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

删除级别或将级别更改为标签对我来说都是有效的。

请检查您的数据结构,并确保由于这些NA导致数据不是零矩阵,或者它不是太稀疏。由于找不到前n个特征,这也可能导致问题。