Question

我正在测试csv格式的新数据集。首先，我使用

建立了受过训练的系统

matrix <- create_matrix(train["Title"], language="english", weighting=tm::weightTfIdf)
container <- create_container(matrix,train$TagId,trainSize=1:x, testSize=(x+1):nrow(train),virgin=FALSE)

# create maxent model using SVM
maxent_model <- train_models(container,algorithms=c("SVM"))
maxent_results <- classify_models(container,maxent_model)

# test the model on test data
maxenttestData = train[(x+1):nrow(train),]
maxenttestData = data.frame(maxenttestData, maxent_results)
write.csv(maxenttestData, "MAXENT.csv", row.names = FALSE)

使用我正在使用的新数据集测试系统

new = read_csv("new.csv")
new$Title = toupper(new$Title)
new$Title = gsub("[<].*[>]", "", as.character(new$Title))
new$Title = gsub("&amp", "", new$Title)
new$Title = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", new$Title)
new$Title = gsub("@\\w+", "", new$Title)
new$Title = gsub("[[:punct:]]", "", new$Title)
new$Title = gsub("[[:digit:]]", "", new$Title)
new$Title = gsub("http\\w+", "", new$Title)
new$Title = gsub("[ \t]{2,}", "", new$Title)
new$Title = gsub("^\\s+|\\s+$", "", new$Title)
#write.csv(new, "preprocess_new.csv", row.names = FALSE)
matrix <- create_matrix(new["Title"], language="english", weighting=tm::weightTfIdf)
container <- create_container(matrix, new$TagId, trainSize=NULL, testSize=1:nrow(new), virgin=FALSE)
maxent_results <- classify_models(container,maxent_model)
write.csv(maxent_results2, "MAXENT_res.csv", row.names = FALSE)

但是它显示了这样的错误

maxent_results <-classify_models（容器，maxent_model） ``predict.svm中的错误（模型，container @ classification_matrix，概率= TRUE，：测试数据与模型不匹配！

Answer 1

查看第一个gsub和以下代码的结果：

aaa <- "<html><title>X</title>all webpage content is between < and >  </html>"
aaa <- gsub("[<].*[>]", "", aaa)
aaa
[1] ""

执行此操作后，如果文本是HTML代码块，则无法分类。

当我测试R中的新数据集时，classify_model不起作用

1 个答案: