Question

我想沿着两个不同的维度（一个二进制，另一个multi-class（3）和不平衡）（但使用相同的数据）进行svm文本分类。我对测试进行了预处理并训练了数据（词干，停用词等），这样我就可以像在dtm中一样获得数据，但是在一个数据框中将我的分类作为一个附加列（作为一个因素，其余的单元格是数字）。现在，我想对其进行调整，以便找到推荐的最佳C参数。

但是，当我运行tune（）或tune.svm（）函数时，我为summary（tune_level）获得的输出很奇怪。没有给出best.parameter（用于空格的空白），仅显示哪种类型的验证和best.performance（两个变量均低于10％！）。当我绘制它时，它是一条直线，每个值C都停留在“最佳性能”值上。我不确定我在做什么错。

我已经尝试过为C使用不同的值，并且我也将数据缩放为[-1; 1]，但是我仍然得到相同的结果。我还尝试将分类作为与数据框分开的独立因素，但这也没有影响。

我的体重向量

wts_ideo <- 1000/table(train_test_ideo)

        1         2         3 
0.7662835 6.8027211 8.4033613

调整代码

tunepara_ideo <- tune.svm( ideo_tt~. , data = train_ideo, kernel="linear",
  cost=10^(-1:2), class.weights=wts_ideo, tunecontrol = tune.control(cross = 5))

tunepara_level <- tune.svm( level_tt~. , data = train_level, 
  cost=5^(-1:2), tunecontrol = tune.control(cross = 5))

我怀疑数据格式有问题，但是我不知道可能是什么问题。起初我以为是因为class.weight（或者因为分类不是二进制），但是由于我的另一个未使用class.weight且也是二进制的变量似乎也不起作用，所以我认为不是这种情况不再。

这是我使用的数据的一小部分摘录

> dput(a)
structure(list(support = structure(c(1.61435223135001, 1.61435223135001, 
-0.348610118070166, -0.348610118070166, -0.348610118070166), .Dim = c(5L, 
1L)), who = structure(c(-0.121854107613728, -0.121854107613728, 
-0.121854107613728, -0.121854107613728, 8.20131124287177), .Dim = c(5L, 
1L)), will = structure(c(-0.247064839669383, -0.247064839669383, 
-0.247064839669383, 1.8065799387465, -0.247064839669383), .Dim = c(5L, 
1L)), promot = structure(c(-0.206975612356537, -0.206975612356537, 
2.86055917077667, -0.206975612356537, -0.206975612356537), .Dim = c(5L, 
1L)), child = structure(c(-0.260431623180936, 3.03906902211947, 
-0.260431623180936, -0.260431623180936, -0.260431623180936), .Dim = c(5L, 
1L)), surviv = structure(c(-0.175707952396644, -0.175707952396644, 
-0.175707952396644, -0.175707952396644, 5.45770415403452), .Dim = c(5L, 
1L)), beyond = structure(c(-0.0981527714501276, -0.0981527714501276, 
10.1817141584266, -0.0981527714501276, -0.0981527714501276), .Dim = c(5L, 
1L)), die = structure(c(-0.136853020267148, -0.136853020267148, 
-0.136853020267148, 6.18656153384136, -0.136853020267148), .Dim = c(5L, 
1L)), kill = structure(c(-0.367191144103825, 2.02640755874725, 
-0.367191144103825, 2.02640755874725, -0.367191144103825), .Dim = c(5L, 
1L)), somehow = structure(c(-0.161586150207654, 6.18470989919798, 
-0.161586150207654, -0.161586150207654, -0.161586150207654), .Dim = c(5L, 
1L))), row.names = c(1095L, 1239L, 1140L, 1517L, 1112L), class = "data.frame")

感谢您的帮助！

Answer 1

我认为这种情况下的最佳性能意味着越低越好。使用软管参数选择最小值，然后重新运行svm以查看其准确性

e1071 tune.svm：最佳性能极低，结果未显示最佳参数

1 个答案: