我正在尝试使用R中的RTextTools库创建文本分类器。训练和测试数据帧的格式相同。它们都由两列组成:第一列是文本,第二列是标签。
到目前为止我的程序的最小可重现示例(替代数据):
The process cannot access the file 'C:\MyProj\Properties\BuildIncrement.cs' because it is being used by another process.
到目前为止我的代码工作,并生成一个表格,将预测值与实际值进行比较。结果非常不准确,我很清楚该模型需要进行微调。
我知道e1071库(RTextTools所基于的)包含# Packages
## Install
install.packages('e1071', 'RTextTools')
## Import
library(e1071)
library(RTextTools)
data.train <- data.frame("content" = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."), "label" = c("yes", "yes", "no"))
data.test <- data.frame("content" = c("It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."), "label" = c("no", "yes", "yes"))
# Process training dataset
data.train.dtm <- create_matrix(data.train$content, language = "english", weighting = tm::weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms = 0, removeStopwords = TRUE, stemWords = TRUE, stripWhitespace = TRUE, toLower = TRUE)
data.train.container <- create_container(data.train.dtm, data.train$label, trainSize = 1:nrow(data.train), virgin = FALSE)
# Create linear SVM model
model.linear <- train_model(data.train.container, "SVM", kernel = "linear", cost = 10, gamma = 1^-2)
# Process testing dataset
data.test.dtm <- create_matrix(data.test$content, originalMatrix = data.train.dtm)
data.test.container <- create_container(data.test.dtm, labels = rep(0, nrow(data.test)), testSize = 1:nrow(data.test), virgin = FALSE)
# Classify testing dataset
model.linear.results <- classify_model(data.test.container, model.linear)
model.linear.results.table <- table(Predicted = model.linear.results$SVM_LABEL, Actual = data.test$label)
model.linear.results.table
函数,可返回最佳成本和gamma值以产生最佳结果。使用它的问题是tune.svm函数上的tune.svm
参数需要读入数据帧,但是因为我正在做一个文本分类器,所以我不只是将一个简单的数据帧读入SVM而是文件 - 术语矩阵。
无济于事,我尝试将DTM作为这样的数据框阅读:
data
我完全迷失了,任何见解都会受到赞赏。
答案 0 :(得分:1)
您可以查看train_model
中的代码(在RStudio中按F2),查看它如何使用容器调用svm()
(在您的情况下为data.train.container
)。默认情况下,train_model
使用
cross=0
(不要对训练数据执行交叉验证)cost=100
(违反约束的费用)probability=TRUE
(模型应允许概率预测)kernel="radial"
(用于SVM培训的径向内核)作为要传递到svm()
的参数。
要真实回答您的问题,create_container()
返回的容器包含您可以在下面使用的广告位training_matrix
和training_codes
:
model.tuned <- tune.svm(x = data.train.container@training_matrix,
y = data.train.container@training_codes,
gamma = 10^(-6:-1),
cost = 10^(-1:1),
# fill in any other SVM params as needed here
)