Question

我有很好的机器学习训练集（只有字符串属性）。

e.g。

@relation training_rel

@attribute class {politics,sports}
@attribute text string

@data
politics,'some text about politics over here'
... // a lot of other training instances of class politics
sports,'and now some sports over here'
... // a lot of other training instances of class sports

好的，这是我的训练集，当然只是一个例子......现在我想建立一个分类器（NaiveBayes）。这完全没问题。我知道大多数分类器都无法处理文本，所以我必须过滤我的数据。我为此使用StringToWordVector。

我发现的所有Web上的示例都定义了测试实例，并且还有类值（http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/）但为什么？我的意思是我不知道我的文字是属于政治还是体育，这就是我使用分类器来了解这一点的原因......我是否理解错误？

Answer 1

测试数据集中的标签用于分类器评估目的。您可以根据训练数据集训练模型，并评估测试数据集上的模型性能。如果没有标签，则无法评估测试数据。

在实际使用时间内，您不会知道实际标签。因此，让测试数据代表真实数据集非常重要。否则您的评估结果没有价值。

为什么WEKA-TestSets必须具有类属性？

1 个答案: