R - 如何调整使用RTextTools创建的文本分类器

时间:2017-07-28 20:13:48

标签: r machine-learning svm



The process cannot access the file 'C:\MyProj\Properties\BuildIncrement.cs' because it is being used by another process.


我知道e1071库(RTextTools所基于的)包含# Packages ## Install install.packages('e1071', 'RTextTools') ## Import library(e1071) library(RTextTools) data.train <- data.frame("content" = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."), "label" = c("yes", "yes", "no")) data.test <- data.frame("content" = c("It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."), "label" = c("no", "yes", "yes")) # Process training dataset data.train.dtm <- create_matrix(data.train$content, language = "english", weighting = tm::weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms = 0, removeStopwords = TRUE, stemWords = TRUE, stripWhitespace = TRUE, toLower = TRUE) data.train.container <- create_container(data.train.dtm, data.train$label, trainSize = 1:nrow(data.train), virgin = FALSE) # Create linear SVM model model.linear <- train_model(data.train.container, "SVM", kernel = "linear", cost = 10, gamma = 1^-2) # Process testing dataset data.test.dtm <- create_matrix(data.test$content, originalMatrix = data.train.dtm) data.test.container <- create_container(data.test.dtm, labels = rep(0, nrow(data.test)), testSize = 1:nrow(data.test), virgin = FALSE) # Classify testing dataset model.linear.results <- classify_model(data.test.container, model.linear) model.linear.results.table <- table(Predicted = model.linear.results$SVM_LABEL, Actual = data.test$label) model.linear.results.table 函数,可返回最佳成本和gamma值以产生最佳结果。使用它的问题是tune.svm函数上的tune.svm参数需要读入数据帧,但是因为我正在做一个文本分类器,所以我不只是将一个简单的数据帧读入SVM而是文件 - 术语矩阵。




1 个答案:

答案 0 :(得分:1)


  • cross=0(不要对训练数据执行交叉验证)
  • cost=100(违反约束的费用)
  • probability=TRUE(模型应允许概率预测)
  • kernel="radial"(用于SVM培训的径向内核)



model.tuned <- tune.svm(x = data.train.container@training_matrix,
                        y = data.train.container@training_codes,
                        gamma = 10^(-6:-1),
                        cost = 10^(-1:1),
                        # fill in any other SVM params as needed here